Introduction to Statistics:

A guide for Biologists and Ecologists
15th July 2024

Dr Philip Leftwich

Welcome!

physalia logo

About me

Associate Professor in Data Science and Genetics at the University of East Anglia.


Academic background in Behavioural Ecology, Genetics, and Insect Pest Control.


Teach Genetics, Programming, and Statistics

UEA logo

About me

Norwich - “A Fine City!”

Norwich cathedral

Norwich on a map

Introduce yourselves

Outline of the course

What to expect during this workshop

These workshop will run for 4 hours each.

We will have a 30 min break halfway through

  • Combines slides, live coding examples, and exercises for you to participate in.

  • Ask questions in the chat throughout!

What to expect during this workshop


I hope you end up with more questions than answers after this workshop!


Schitts Creek questions gif

Source: giphy.com

What you need for this workshop

Workshop Resources



  • Slides available [here]

  • Data available [here]

  • Scripts available [here]

  • Posit Cloud workspace available [here]

An Introduction to R

What is R?

  • R is a full programming language

  • R is a calculator

  • R simulates & generates data

  • R reads, processes and manipulates data

  • R analyses data

  • R makes plots, graphics, reports and presentations

What is RStudio/Posit

What is RStudio/Posit?

R is for Open Reproducible Research

  • Reproducible

  • Robust

  • Transparent

  • Reusable

  • Shareable research materials

Is data and stats reporting enough?

Main features of R

Comprehensive Packages: Thousands of packages available through CRAN, Bioconductor, and GitHub to extend R’s capabilities.

Data Manipulation: Powerful libraries like dplyr and tidyr for data wrangling.

Statistical Analysis: Built-in functions for a wide range of statistical tests and models.

Visualization: Excellent plotting libraries such as ggplot2 for creating stunning visualizations.

Main Features of RStudio

User-Friendly Interface: Four-pane layout (script, console, environment, files) for easy navigation. Integrated Package Management: Simplifies the installation and management of R packages. Version Control: Built-in support for Git to manage code versions and collaboration. Projects: Organize workspaces efficiently using RStudio Projects. Using Projects to Maintain a Working Environment

Why Use RStudio Projects?

Organization: Keeps related files (scripts, data, outputs) in one place. Reproducibility: Facilitates reproducible research by maintaining a consistent environment. Isolation: Projects can have their own libraries and settings, reducing conflicts between different projects.

Creating a Project in RStudio

Start a New Project: Go to File > New Project. Choose a Type: Select whether to create a new directory, use an existing one, or version control. Maintain Structure: Follow a structured directory format (e.g., /data, /scripts, /results).

Best Practices for Projects

Regularly save and document your code. Use version control with Git for collaboration. Keep your working directory clean and organized. Conclusion

Emphasises reproduciblity here

Getting Started

Encourage attendees to install R and RStudio. Provide resources for learning R (e.g., online courses, tutorials). Emphasize the importance of starting with projects for better organization.

Questions and Discussion

Open the floor for questions. Discuss participants’ goals for learning R and how they envision using RStudio.

Data wrangling

Object-Oriented Programming

R is an object-oriented language. Everything in R is an object, including:

Vectors

Data frames

Lists

Objects allow you to store data and manipulate it using functions.

Creating Objects

You can create objects using the assignment operator <-: R Copy code my_data <- data.frame(name = c(“Alice”, “Bob”), age = c(25, 30)) This creates a data frame called my_data.

Tidy data

  • Tidy data is a standardized way of organizing data values within a dataset.
  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observational unit forms a table.

Why is Tidy Data Important?

  • Makes data easier to manipulate, analyze, and visualize.
  • Simplifies data cleaning and preparation tasks.
  • Enhances reproducibility and collaboration.

Example of Tidy Data

Name Age Height
John 23 180
Alice 25 165
Bob 22 175

Example of Non-Tidy Data

Name John Alice Bob
Age 23 25 22
Height 180 165 175

Converting Non-Tidy Data to Tidy Data

Identify variables and observations.

Restructure data so that each variable forms a column and each observation forms a row.

Example

Variable John Alice Bob
Age 23 25 22
Height 180 165 175
Name Age Height
John 23 180
Alice 25 165
Bob 22 175

Another example

Subject Treatment_A Treatment_B
John 80 85
Alice 90 95
Bob 85 80

Tidy data for treatments

Subject Treatment Score
John A 80
John B 85
Alice A 90
Alice B 95
Bob A 85
Bob B 80

Tidy data is long

# A tibble: 3 × 3
  Subject Treatment_A Treatment_B
  <chr>         <dbl>       <dbl>
1 John             80          85
2 Alice            90          95
3 Bob              85          80
# A tibble: 6 × 3
  Subject Treatment   Score
  <chr>   <chr>       <dbl>
1 John    Treatment_A    80
2 John    Treatment_B    85
3 Alice   Treatment_A    90
4 Alice   Treatment_B    95
5 Bob     Treatment_A    85
6 Bob     Treatment_B    80

Benefits of tidy data

  • Messy data can be messy in lots of different ways

  • Tidy data is always consistent

  • Our work as an analyst is easier if we can use consistent tools to work with our data

What is the Tidyverse?

  • The tidyverse is a collection of R packages designed for data science.

  • Developed by Hadley Wickham, it provides powerful tools for data wrangling and visualization.

  • they help make data tidy and work with manipulating and cleaning tidy data to generate insights

  • Core packages include:

    • dplyr: Data manipulation
    • tidyr: Data tidying
    • readr: Data import
    • purrr: Functional programming
    • ggplot2: Data visualization
    • tibble: Modern data frames

Loading the Tidyverse

  • Load the tidyverse with:
  library(tidyverse)

What is Data Wrangling?

Data wrangling involves:

  • Cleaning

  • Checking

  • Transforming

Raw data into a usable format.

This process is essential for effective analysis and visualization.

Keeping Raw Data Raw

By keeping raw data intact:

  • Integrity: You preserve the original dataset for reference.

  • Reproducibility: Enables others to verify your work and replicate analyses.

  • Transparency: Allows tracking of changes made during the wrangling process.

Data Organization in Spreadsheets

Documenting Data Manipulations

When we have a script we are maintaining a history of data checks and manipulations:

  • Helps in understanding decisions made during analysis.

  • Supports data integrity by enabling audits and reviews.

What is dplyr?

dplyr is a key package in the tidyverse focused on data manipulation.

It provides six essential functions, known as the “Wickham Six”, for intuitive data wrangling.

The Wickham Six Functions

Function Description
select() Include or exclude specific columns.
filter() Include or exclude specific rows.
mutate() Create new columns or modify existing ones.
arrange() Change the order of rows.
group_by() Organize data into groups for analysis.
summarise() Create summary statistics for groups.

Introducing the Palmer Penguins Dataset

The palmerpenguins dataset includes data on penguin species, bill length, flipper length, and more.

It’s a great resource for practicing data wrangling in R.

To get started, load the dataset:

# IMPORT DATA ----
penguins <- read_csv ("data/penguins_raw.csv")

head(penguins) # check the data has loaded, prints first 10 rows of dataframe
#__________________________----
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Using select()

Purpose: Choose specific columns from a data frame.

Example:

penguins %>% 
  select(species, bill_length_mm, flipper_length_mm)
# A tibble: 344 × 3
   species bill_length_mm flipper_length_mm
   <fct>            <dbl>             <int>
 1 Adelie            39.1               181
 2 Adelie            39.5               186
 3 Adelie            40.3               195
 4 Adelie            NA                  NA
 5 Adelie            36.7               193
 6 Adelie            39.3               190
 7 Adelie            38.9               181
 8 Adelie            39.2               195
 9 Adelie            34.1               193
10 Adelie            42                 190
# ℹ 334 more rows

Using filter()

Purpose: Filter rows based on conditions.

Example:

penguins %>% filter(bill_length_mm > 40)
# A tibble: 242 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           40.3          18                 195        3250
 2 Adelie  Torgersen           42            20.2               190        4250
 3 Adelie  Torgersen           41.1          17.6               182        3200
 4 Adelie  Torgersen           42.5          20.7               197        4500
 5 Adelie  Torgersen           46            21.5               194        4200
 6 Adelie  Biscoe              40.6          18.6               183        3550
 7 Adelie  Biscoe              40.5          17.9               187        3200
 8 Adelie  Biscoe              40.5          18.9               180        3950
 9 Adelie  Dream               40.9          18.9               184        3900
10 Adelie  Dream               42.2          18.5               180        3550
# ℹ 232 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Using mutate()

Purpose: Add or modify columns.

Example:

 penguins %>% mutate(bill_length_cm = bill_length_mm / 10)
# A tibble: 344 × 9
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 3 more variables: sex <fct>, year <int>, bill_length_cm <dbl>

Using arrange()

Purpose: Sort rows by specific columns.

Example:

penguins %>% arrange(desc(flipper_length_mm))
# A tibble: 344 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo  Biscoe           54.3          15.7               231        5650
 2 Gentoo  Biscoe           50            16.3               230        5700
 3 Gentoo  Biscoe           59.6          17                 230        6050
 4 Gentoo  Biscoe           49.8          16.8               230        5700
 5 Gentoo  Biscoe           48.6          16                 230        5800
 6 Gentoo  Biscoe           52.1          17                 230        5550
 7 Gentoo  Biscoe           51.5          16.3               230        5500
 8 Gentoo  Biscoe           55.1          16                 230        5850
 9 Gentoo  Biscoe           49.5          16.2               229        5800
10 Gentoo  Biscoe           49.8          15.9               229        5950
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Using group_by()

Purpose: Group data for summary operations.

Example:

penguins %>% group_by(species)
# A tibble: 344 × 8
# Groups:   species [3]
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Using summarise()

Purpose: Create summary statistics for grouped data.

Example:

penguins %>% 
  group_by%>% 
  summarise(avg_bill_length = mean(bill_length_mm, na.rm = TRUE))
# A tibble: 1 × 1
  avg_bill_length
            <dbl>
1            43.9

Example Workflow with Palmer Penguins

Wrangle data while preserving the raw dataset:

penguins_cleaned <- penguins %>%
  filter(!is.na(bill_length_mm)) %>%
  mutate(bill_length_cm = bill_length_mm / 10) %>%
  arrange(species)

penguins_cleaned
# A tibble: 342 × 9
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           36.7          19.3               193        3450
 5 Adelie  Torgersen           39.3          20.6               190        3650
 6 Adelie  Torgersen           38.9          17.8               181        3625
 7 Adelie  Torgersen           39.2          19.6               195        4675
 8 Adelie  Torgersen           34.1          18.1               193        3475
 9 Adelie  Torgersen           42            20.2               190        4250
10 Adelie  Torgersen           37.8          17.1               186        3300
# ℹ 332 more rows
# ℹ 3 more variables: sex <fct>, year <int>, bill_length_cm <dbl>

Documenting Your Work

Always comment your code:

# Filtering out missing values and converting bill length to cm
penguins_cleaned <- penguins %>%
  filter(!is.na(bill_length_mm)) %>%
  mutate(bill_length_cm = bill_length_mm / 10) %>%
  arrange(species)

Recap and Next Steps

We introduced the tidyverse and focused on dplyr for data wrangling using the palmerpenguins dataset.

Emphasized the importance of keeping raw data raw and documenting manipulations for transparency.

Questions and Discussion

Introduction to ggplot

Why do we need to visualise data?

dataset mean_x std_dev_x
away 54.26610 16.76982
bullseye 54.26873 16.76924
circle 54.26732 16.76001
dino 54.26327 16.76514
dots 54.26030 16.76774
h_lines 54.26144 16.76590
high_lines 54.26881 16.76670
slant_down 54.26785 16.76676
slant_up 54.26588 16.76885
star 54.26734 16.76896
v_lines 54.26993 16.76996
wide_lines 54.26692 16.77000
x_shape 54.26015 16.76996

data points make the shape of a dinosaur

When we visualise the data we can see trends or patterns we might otherwise miss

Originally created by Alberto Cairo in Download the Datasaurus: Never trust summary statistics alone; always visualize your data

The problem with dynamite plots

A multitude of different patterns appear when visualising the data

These datasets all have the same mean and standard deviation

Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4): e1002128.

Exploring data

Visualisation according to statisticians

Visualization According to Statisticians: An Interview Study on the Role of Visualization for Inferential Statistics: https://arxiv.org/pdf/2309.12684.pdf

A grammar of graphics

  • ggplot2 is an application of the grammar of graphics for R
  • A default dataset and set of mappings from variables to aesthetics
  • One or more layers of geometric objects
  • One scale for each aesthetic mapping used
  • A coordinate system
  • The facet specification

A grammar of graphics, from ggplot2 as a creativity engine

Easy enough to rapidly prototype graphics at the “speed of thought”

A screenshot of a scatterplot with team rank on the x-axis and team rating on the y-axis. The teams are all trending down from rank 1 to rank 20.

Powerful enough for final “publication”

A facetted scattergraph by year for the premier league. Team rating is on the Y-axis and rank is on the x-axis. The teams overall are declining.

ggplot(data = penguins, 
       aes(x = bill_depth_mm,#<<
           y = body_mass_g))#<<

ggplot(data = penguins, 
       aes(x = bill_depth_mm,
           y = body_mass_g))+
         geom_point()#<<

ggplot(data = penguins, 
       aes(x = bill_depth_mm,
           y = body_mass_g))+
         geom_point(colour = "red")#<<

ggplot(data = penguins, 
       aes(x = bill_depth_mm,
           y = body_mass_g))+
         geom_point(aes(#<<
           colour = species))#<<

ggplot(data = penguins, 
       aes(x = bill_depth_mm,
           y = body_mass_g,
           colour = species))+#<<
         geom_point()+
  geom_smooth(method = "lm", #<<
              se = FALSE)#<<

ggplot(data = penguins, 
       aes(x = bill_depth_mm,
           y = body_mass_g,
           colour = species))+
         geom_point()+
  geom_smooth(method = "lm", 
              se = FALSE)+
  labs(x = "Body mass (g)",#<<
       y = "Flipper length (mm)")#<<

ggplot(data = penguins, 
       aes(x = bill_depth_mm,
           y = body_mass_g,
           colour = species))+
         geom_point()+
  geom_smooth(method = "lm", 
              se = FALSE)+
  labs(x = "Body mass (g)",
       y = "Flipper length (mm)")+
  scale_color_manual(
    values = c("darkolivegreen4", #<<
               "darkorchid3", #<<
               "goldenrod1"))#<<

ggplot(data = penguins, 
       aes(x = bill_depth_mm,
           y = body_mass_g,
           colour = species))+
         geom_point()+
  geom_smooth(method = "lm", 
              se = FALSE)+
  labs(x = "Body mass (g)",
       y = "Flipper length (mm)")+
  scale_color_manual(
    values = c("darkolivegreen4", 
               "darkorchid3", 
               "goldenrod1"))+
  theme_classic()#<<

Know your ABC’s

Accurate

Beautiful

Clear

Jitter

ggplot(data = penguins, 
       aes(x = species,
           y = body_mass_g,
           colour = species))+
         geom_point()+#<<
  labs(x = "Species",
       y = "Body mass (g)")+
  scale_color_manual(
    values = c("darkolivegreen4", 
               "darkorchid3", 
               "goldenrod1"))+
  theme_classic()+
  +
  theme(legend.position = "none")

Jitter

ggplot(data = penguins, 
       aes(x = species,
           y = body_mass_g,
           colour = species))+
         geom_jitter(#<<
           width = 0.2)+#<<
  labs(x = "Species",
       y = "Body mass (g)")+
  scale_color_manual(
    values = c("darkolivegreen4", 
               "darkorchid3", 
               "goldenrod1"))+
  theme_classic()+
  theme(legend.position = "none")

Also see ggbeeswarm

Stacked bars

penguins %>%     
  group_by(species) %>% #<<
    summarise(n=n()) %>% #<<
ggplot(aes(x = " ",
           y = n,
           fill = species))+
        geom_col()+ #<<
  labs(x = " ",
       y = "Count")+
  scale_fill_manual(
    values = c("darkolivegreen4", 
               "darkorchid3", 
               "goldenrod1"))+
  theme_classic()

Dodged plots

penguins %>%     
  group_by(species) %>% 
    summarise(n=n()) %>% 
ggplot(aes(x = species, #<<
           y = n,
           fill = species))+
        geom_col()+
  labs(x = "Species",#<<
       y = "Count")+
  scale_fill_manual(
    values = c("darkolivegreen4", 
               "darkorchid3", 
               "goldenrod1"))+
  theme_classic()+
  theme(legend.position = "none")#<<

Labels

plot + 
  annotate("text",
           x = 1:3,
           y = c(156, 72, 128),
           label = c(152, 68, 124))

Labels

plot + 
 geom_label(aes(label = n),
            fill = "white",
            nudge_y = 1,
            colour = "black",
            fontface = "bold")

Facets

ggplot(penguins,
       aes(x = body_mass_g,
           fill = species))+
        geom_density(
       alpha = 0.4)+
  labs(y = "Density")+
  scale_fill_manual(
    values = c("darkolivegreen4", 
               "darkorchid3", 
               "goldenrod1"))+
  theme_classic()+
  theme(legend.position = "none")+
  facet_wrap(~species)#<<

Facets

ggplot(penguins,
       aes(x = body_mass_g,
           fill = species))+
        geom_density(
       alpha = 0.4)+
  labs(y = "Density")+
  scale_fill_manual(
    values = c("darkolivegreen4", 
               "darkorchid3", 
               "goldenrod1"))+
  theme_classic()+
  theme(legend.position = "none")+
  facet_wrap(~species)

Highlights

library(gghighlight)

ggplot(penguins,
       aes(x = body_mass_g,
           fill = species))+
        geom_density(
       alpha = 0.4)+
  gghighlight()+#<<
  labs(y = "Density")+
  scale_fill_manual(
    values = c("darkolivegreen4", 
               "darkorchid3", 
               "goldenrod1"))+
  theme_classic()+
  theme(legend.position = "none")+
  facet_wrap(~species)

Legend positions

plot +
  theme(legend.position = c(0.7,  #<<
                            0.8))  #<<

Legend positions

plot +
  theme(legend.position = "top") #<<

ggforce

geomtextpaths

ggtext

Keep learning

R Cheat Sheets

Fundamentals of Data Visualisation

Beautiful plotting in R

The ggplot2 book

The ggplot extensions library

Understanding Data Quality

Importance of data quality

Missing data

The first issue we will cover is missing data. There is a whole host of reasons that your data could have missing values.

This can be broken down into three broad categories:

  • MCAR (Missing Completely at Random) The probability of missing data is independent of any observed or unobserved data.

Leaf samples lost randomly due to a labeling error in a plant growth study.

  • MAR (Missing at Random) The probability of missing data is related to some observed data but not to the missing data itself.

Blood pressure readings missing more frequently for older patients in a drug trial.

MNAR (Missing Not at Random) The probability of missing data is related to the value of the missing data itself. Heavier birds in a wildlife study are harder to capture and weigh, leading to missing weight data.

Duplications

Duplications can skew results and reduce reliability of our data insights

Data types and consistency

Example: Data is treated as a factor when it should be numerical

Ensuring correct data types is essential for performing accurate analyses

Implausible values

Values that fall outside of a possible range for a given variable.

  • Categorical: “Male”, “Female”, “Mle”

  • Continuous: Negative values for an age, or a date given in the future

Outliers

Observations that fall outside of a tolerated range from other observations

  • 3* Standard Deviation from the mean

  • 1.5 * Interquartile Range

Important to identify, but crucially these are real values

Distributons

How values are spread across a range:

  • Normal distribution

  • Skewed distribution

  • Uniform distribution

Strategies to check and improve data quality

  • We need to have strategies for removing or correcting errors in the data.

  • R scripts give us a record of changes without affecting the integrity of the raw data

  • Having a clear series of checks saves time in the long run

Descriptive Statistics

Building Statistical Models

A model is a representation of what we think is going on in the real world. It can be a good fit, or a poor fit.

A well fitted model could be used to make predictions about what might happen in the real world.

A poorly fitted model could also make predictions but they might be disastrously wrong.

If we want our inferences to be accurate the model must represent the data collected.

{fig-alt=’ Model bridges - Andy Field Discovering Statistics’ width=485}

Populations

  • The total “pool” you are trying to explore/describe

  • You define what it is

  • Usually you’re not going to be able to study the entire population, because: time, money, access, resources, support, etc.

  • So instead, we can usually only take a…

Sample

  • Drawn randomly from the population…

  • …to collect observations for variable(s) of interest…

  • …to try to say something meaningful about the population.

Central tendency & spread

Often we will start with

Central tendency - a central or typical value

Data Spread - how observations are dispersed

Central tendency

The central tendency of a series of observations is a measure of the “middle value”

The three most commonly reported measures of central tendency are the sample mean, median, and mode.

Mean Median Mode
The average value The middle value The most frequent value
Sum of the total divided by n The middle value (if n is odd). The average of the two central values (if n is even) The most frequent value
Most common reported measure, affected by outliers Less influenced by outliers, improves as n increases Less common

The mean

One of the simplest statistical models in biology is the mean

Lecturer Friends
Mark 5
Tony 3
Becky 3
Ellen 2
Phil 1
Mean 2.6
Median 3
Mode 3

Calculating the mean:

\[ \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} \] \[\frac{5 + 3 + 3 + 2 + 1}{5} = 2.6\]

The mean

One of the simplest statistical models in biology is the mean

Lecturer Friends
Mark 5
Tony 3
Becky 3
Ellen 2
Phil 1
Mean 2.6
Median 3
Mode 3

We already know this is a hypothetical value as you can’t have 2.6 friends (I think?)

Now with any model we have to know how well it fits/ how accurate it is

Symmetrical

If data is symmetrically distributed, the mean and median will be close, especially as n increases.

This is also know as a normal distribution

Why is data distribution important?

Understanding the shape of our data informs the summary statistics we can use

Frequency distribution with central tendencies

Asymmetrical

Depending on our question, if the data is not symmetrical this could really matter…

Assessing the fit of a central tendency

So we have multiple ways to describe a central tendency of a set of observations.

But we do not know how the data are distributed or spread around this

Data Spread

  • Range

  • Variance

  • Standard deviation

Range

Going back to our friends data:

Lecturer Friends
Mark 5
Tony 3
Becky 3
Ellen 2
Phil 1
Median 3
Range 4

.right-plot[

  • The range is affected by extreme values

  • The interquartile range (IQR) looks at the middle 50% of scores

  • The first quartile is the median of the lower 50% of the data: 1.5

  • The second quartile is the median: 3

  • The third quartile is the median of the upper 50% of the data: 5

Range

  • Box bounded by 25th and 75th percentile (first and third quartile). This is called the interquartile range (IQR)

  • Line in box is usually the median

  • Whiskers extend to last OBSERVATION within 1 step (usually 1.5*IQR) from end of the box

  • Any observations beyond whisker are plotted as individual points

Boxplot

Box and whiskers

Variance

Sample variance is a more robust measure of data spread because it is not so highly influence by outliers and takes into account all observations. In words, variance is the mean squared distance of observations from the sample mean.

\[s{^2}_{sample} = \frac{\sum(x - \bar{x})^2}{N -1}\]

Calculating variance

Symmetrical sides


   value residuals squared residuals
 1 12.7      2.88         8.27 
 2 14.2      4.32        18.7  
 3 11.3      1.41         1.99 
 4  7.43    -2.44         5.93 
 5  8.27    -1.60         2.57 
 6 12.7      2.81         7.90 
 7  5.34    -4.53        20.5  
 8 13.9      4.00        16.0  

  sum of residuals sum of squares  mean
1       -0.0577        359        9.87

N-1?

Sampling

For a population the variance \(S_p^2\) is exactly the mean squared distance of the values from the population mean

\[ s{^2}_{pop} = \frac{\sum(x - \bar{x})^2}{N} \]

But this is a biased estimate for the population variance

  • A biased sample variance will underestimate population variance

  • n-1 (if you take a large enough sample size, will correct for this)

More here

Standard Deviation

  • Square root of sample variance

  • A measure of dispersion of the sample

  • Smaller SD = more values closer to mean, larger SD = greater data spread from mean

  • variance:

\[ \sigma = \sqrt{\sum(x - \overline x)^2\over n - 1} \]

Standard Deviation

Small \(s\) = taller, narrower Large \(s\) = squatter, wider

Visualising a Distribution

Histograms plot frequency/density of observations within bins

Quantile-Quantile plots plot quantiles of a dataset vs. quantiles of a theoretical (usually normal) distribution

QQ plots

If data is normally distributed, then if we plot the quantiles vs. theoretical normal quantiles (with same n), then we find a linear relationship

QQ plot examples

Right-skewed

Left-skewed

Bimodal

Heavy tailed

Light tailed

Understanding Bias

Multicollinearity

  • Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated, making it difficult to determine their individual effect on the response variable.

Impact: - Inflated standard errors, leading to unreliable coefficient estimates. - Complicated interpretation of model results.

Action: - Check for multicollinearity using Variance Inflation Factor (VIF) and correlation matrices. - Consider removing or combining correlated variables to simplify the model.

Example: - In a model predicting house prices, having both “square footage” and “number of rooms” may lead to multicollinearity.

Omitted Variable Bias

Definition: - Omitted Variable Bias occurs when a relevant variable that influences the dependent variable is left out of the model, leading to biased estimates of the included variables.

Impact: - Misleading conclusions about the relationships among variables. - Reduced model accuracy and reliability.

Action: - Conduct thorough literature reviews to identify all relevant variables. - Use domain knowledge to guide variable selection and inclusion.

Example: - If studying the effect of education on income without including “work experience,” the model may incorrectly attribute income differences solely to education.

Confirmation Bias

Definition: - Confirmation Bias is the tendency to search for, interpret, and remember information that confirms pre-existing beliefs or hypotheses, rather than seeking out contradictory evidence.

Impact: - Skewed analysis outcomes and reduced objectivity in decision-making. - Important insights or alternative explanations may be overlooked.

Action: - Encourage a culture of critical thinking and actively seek out opposing viewpoints. - Use blind analysis techniques where applicable.

Example: - Analysts may only report findings that support a preferred model while ignoring results that suggest alternative explanations.

Discussion

  • How can we implement strategies to recognize and mitigate these biases in our data analysis processes?

Considerations:

  • Share examples from your experiences where these biases impacted outcomes.

  • Discuss methods or tools you use to identify and address bias in your work.

Inferential Statistics

Inferential statistics

Descriptive

  • Summarises the data in a meaningful way

  • Measures of central tendency & measures of variability

  • Analyses a sample of data to infer about a larger population

  • Includes hypothesis testing and building confidence intervals

Understanding Standard Deviation

Standard deviation is a measure of the spread of data points around the mean.

\[ s = \sqrt{\sum(x - \overline x)^2\over n - 1} \] A low standard deviation indicates that data points are close to the mean, while a high standard deviation indicates more spread out data.

Standard deviation predicts data

The standard deviation tells you how far each score lies from the mean in your datset

Introduction to Standard Error

Standard error measures the accuracy of a sample mean as an estimate of the population mean.

\[ SE = {s\over \sqrt n} \]

Standard Error

Standard deviation vs. standard error:

  • Standard deviation measures variability within a sample

  • Standard error measures variability between sample means.

Are two distributions the same or different?

Standard Error vs. Standard Error of the Difference

Standard error of the difference measures the accuracy of the difference between two sample means.

\(Mean~difference = \overline x_1 - \overline x_2\)

\(SED = \sqrt{s_1^2\over n_1}+{s_2^2\over n_2}\)

The Normal Distribution

Also known as the Gaussian distribution, its a bell shaped probability distribution

  • Symmetrical around the mean

  • Mean, mode, median are equal

  • The shape is determined by two parameters: the mean (\(\mu\)) and standard deviation (\(s\))

The Z Distribution

The Z distribution is specific type of normal distribution with a mean of 0 and a standard deviation of 1.

  • Z scores represent the number of standard deviations a data point is from the mean

\(z = \frac{X - \mu}{s}\)

  • Gives us standardized scores to compare normal distributions

Confidence Intervals Explained

Can be constructed by YOU:

Do you want to know the range within which you will capture the mean…

  • 66% of the time = 1 * S.E.

  • 95% = 1.96 * S.E.

  • 99% = 2.58 * S.E.

The Confidence level (C) is the inverse of the significance level.

\[ \alpha = 1 - C \]

The T Distribution

Under realistic scenarios we only ever have an estimate of the standard deviation.

  • At large enough sample sizes this doesn’t matter, but how large?

  • The t-distribution is a correction for small sample size and unknown population variance with a normally distributed population

the t-distribution is closely related to a normal distribution, but influenced by sample size n and has fatter tails

Calculate Confidence intervals based on t-distribution

\({95\%~CI} = {\overline x \pm t*SE}\)

Calculating t based on sample size WILL affect confidence intervals

Understanding P-Values

  • Definition of p-value: (probability of observing the data given that the null hypothesis is true)

  • Interpretation of p-values: (low p-value indicates strong evidence against the null hypothesis)

  • Common thresholds (e.g., p < 0.05)

Relationship between p-values and Confidence Intervals

  • p-values and confidence intervals are both derived from the same underlying theory

  • a confidence interval that does not include the null hypothesis (e.g. zero difference) has a corresponding p-value

  • For example a 95% confidence interval that does not include zero is the same as p < 0.05

Using t-values

Observed t is the product of the mean difference divided by the Standard error of the difference

\({t=}{mean~difference\over~SE}\)

\({-2.437=}{-2.617\over1.074}\)

\({95\%~CI} = {\overline x \pm t_{critical}*SE}\)

\({95\%~CI} = {-2.617 \pm 2.048*1.07}\)

\({95\%CI;[-4.81:-0.417]}\)

$ = {-2.617 }$

p = 0.05 when observed t > than critical t:

Df Critical t-value
1 12.7
5 2.57
10 2.22
20 2.08
28 2.048
30 2.04

Introduction to Linear Models

Common statistical tests are linear models

Hypothesis testing

Regression

ANOVA

z and t distributions

Model summary

In this example of a simple linear model, we run the equivalent to a Student’s t-test.

R model summary provides, the formula of the regression, the estimate of the intercept and standard error, estimated differences and uncertainity for each slope, the degrees of freedom for the whole model, F value and R squared

Q. What happens when we have more than two groups in our predictor variable? Why can’t we just do more t-tests?

ANalysis Of VAriance (ANOVA)

ANOVAs use information about variances, but the main goal of analysis is comparison of MEANS (don’t let the name mix you up - more on this later).

ANOVA is an omnibus test – it tests for significant differences between any means in the study

ANOVA is just a special case of the linear model

ANOVA Hypotheses:

Null Hypothesis (H0): All means are equal (in other words, all groups are from populations with the same mean)

Alternative Hypothesis (HA): At least two group means are NOT equal (that means that just two could be different, or they could ALL be different)

ANOVA Example

We have collected data for soil uranium concentrations at three locations on Los Alamos National Lab property: Site A, Site B, and Site C. The data structure is shown here:

A one-way ANOVA can be used to assess whether there is a statistically significant difference in uranium concentration in soil at three locations

A tidy dataframe illustrating one continuous dependent variable and and one factor predictor variable (with three levels)

One-way ANOVA (single factor)

What does this one-way/single-factor refer to?

There is a single factor (variable), with at least 3 levels, where we are trying to compare means across the different levels of that factor.

ANOVA does this indirectly by looking at total between and within group variances as a ratio (F).

\[ SSE = \underset{i=1}{n \atop{\sum}}(y_i - \hat{y_i})^2 \]

\[ SSR = \underset{i=1}{n \atop{\sum}}(\hat{y_i} - \overline{y})^2 \]

\[ SST = \underset{i=1}{n \atop{\sum}}(y_i - \overline{y})^2 \]

where:

\(y_i\) = Observed value

\(\hat{y_i}\) = Value estimated by model

\(\overline{y}\) = The Grand Mean

A tidy dataframe illustrating one continuous dependent variable and and one factor predictor variable (with three levels)

\(SSE + SSR = SST\)

What does an ANOVA actually do?

\[ \large F = {SSR / (k-1)\over SSE / (N-k)} = {MSR\over MSE} \]

\(k\) = Total number of groups

\(N\) = numerator degrees of freedom = Total number of observations across all groups

\(N-k\) = denominator degrees of freedom

\(MSR\) = Mean Squares Regression

\(MSE\) = Mean Squares Error

This is a ratio of the between group variance and the the within group variance.

F distribution

The F-value or ratio of variances, over their respective degrees of freedom will have an F-distribution.

This F distribution is used when we want to compare within and between group variances.

The curve of the distribution depends on the degrees of freedom, and it is always positively skewed

Significance testing

  • The higher the F-value the greater the signal-to-noise ratio.

  • For a given value of numerator and denominator degrees of freedom we can look up the probability of observing this ratio under a null hypothesis of identical variances.

  • If F value is high enough then we might have enough evidence to conclude that samples are likely drawn from populations with different means.

Ask a question about: ANOVA

ANOVA Example

id bulb_type lifetime_hours
1 longlast_bulbs 760
2 greenlight_bulbs 1020
3 superdisco_bulbs 430
4 fluorolight_bulbs 940
5 longlast_bulbs 820
6 greenlight_bulbs 995

Call:
lm(formula = lifetime_hours ~ bulb_type, data = bulbs)

Residuals:
     Min       1Q   Median       3Q      Max 
-143.333  -35.000    5.208   26.250  136.667 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 893.33      42.26  21.140 1.25e-09 ***
bulb_typegreenlight_bulbs    97.92      55.90   1.752 0.110395    
bulb_typelonglast_bulbs    -132.08      55.90  -2.363 0.039763 *  
bulb_typesuperdisco_bulbs  -320.00      59.76  -5.355 0.000321 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 73.19 on 10 degrees of freedom
Multiple R-squared:  0.8601,    Adjusted R-squared:  0.8182 
F-statistic:  20.5 on 3 and 10 DF,  p-value: 0.0001362

Summary table

Analysis of Variance Table

Response: lifetime_hours
          Df Sum Sq Mean Sq F value    Pr(>F)    
bulb_type  3 329429  109810  20.498 0.0001362 ***
Residuals 10  53571    5357                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Post-hoc vs. Planned contrasts

Correcting for multiple comparisons

  • The F-ratio tells us only whether the model fitted to the data (SSR) accounts for more variance than other factors (SSE).

  • So if the F-ratio is large enough to be statistically significant, then we only know that one or more differences between means are statistically significant.

  • Further testing is needed!

Planned contrasts

  • You focused on a few scientifically sensible comparisons, rather than every possible comparison

  • You chose these as part of the experimental design before collecting data

  • Could structure linear model to reflect this?

Post hoc

  • Typically unplanned comparisons, conducted only after a significant ANOVA result

  • All combinations checked

  • Needs correcting for inflated Type 1 error

Post hoc

Method Equation Notes
Bonferroni/Dunn-Sidak \(p = {\alpha \over k}\) Correct critical p for the number of independent tests
Holm’s \(p_{1} < \alpha/(k–1+1) = \alpha/k\) we start with the smallest p-value (i = 1) and determine whether there is a significant result (i.e. p1 < α/(k–1+1) = α/k. If so we move on to the second test. We continue in this manner until we get a non-significant result
Tukey HSD \(t_{crit} = q * \sqrt{MSE \over N}\) Essential a t-test, correcting for multiple comparison, q-values can be looked up for test df and number of treatments

Post hoc testing

means <- emmeans::emmeans(bulb_lsmodel0, 
                 specs = pairwise ~ bulb_type)

confint(means)
$emmeans
 bulb_type         emmean   SE df lower.CL upper.CL
 fluorolight_bulbs    893 42.3 10      799      987
 greenlight_bulbs     991 36.6 10      910     1073
 longlast_bulbs       761 36.6 10      680      843
 superdisco_bulbs     573 42.3 10      479      667

Confidence level used: 0.95 

$contrasts
 contrast                             estimate   SE df lower.CL upper.CL
 fluorolight_bulbs - greenlight_bulbs    -97.9 55.9 10   -268.9     73.1
 fluorolight_bulbs - longlast_bulbs      132.1 55.9 10    -38.9    303.1
 fluorolight_bulbs - superdisco_bulbs    320.0 59.8 10    137.2    502.8
 greenlight_bulbs - longlast_bulbs       230.0 51.8 10     71.7    388.3
 greenlight_bulbs - superdisco_bulbs     417.9 55.9 10    246.9    588.9
 longlast_bulbs - superdisco_bulbs       187.9 55.9 10     16.9    358.9

Confidence level used: 0.95 
Conf-level adjustment: tukey method for comparing a family of 4 estimates 

Complex models

Basic linear models

Basic linear models aim to describe a linear relationship between a response (outcome) variable and a predictor (input) variable, usually by the method of ordinary least squares

Straight line equation

\[ \LARGE{y_i = a + bx+\epsilon} \]

Where:

\(y_i\) is the predicted value of the response variable

\(a\) is the intercept (value of y when x = 0)

\(b\) is the slope of the regression line

\(x\) is the value of the explanatory variable

\(\epsilon\) is the value of the residual error

Culmen: the ridge along the top part of a bird's bill

library(ggpubr)
penguins %>% 
  filter(species=="Adelie") %>% 
ggplot(aes(x= bill_length_mm, 
                     y= bill_depth_mm)) +
    geom_point()+
    geom_smooth(method = "lm",
                se = FALSE)+
  theme_classic()+
  stat_cor(method = "pearson", #<<
           aes(label=..r.label..))+#<<
  stat_regline_equation(label.y = 21.7)

what about an extra variable?

Now we have two reasonable predictors of bill depth:

Bill length & Body mass

# A tibble: 6 × 3
  bill_length_mm bill_depth_mm body_mass_g
           <dbl>         <dbl>       <int>
1           39.1          18.7        3750
2           39.5          17.4        3800
3           40.3          18          3250
4           NA            NA            NA
5           36.7          19.3        3450
6           39.3          20.6        3650

ANCOVA


Call:
lm(formula = bill_length_mm ~ bill_depth_mm + body_mass_g, data = .)

Coefficients:
  (Intercept)  bill_depth_mm    body_mass_g  
    23.310456       0.162859       0.004241  

Written as an equation this is:

\(\large{bill~length=23.31+0.16_{(bill~depth)}+0.004_{(body~mass)}}\)

Two-way ANOVA

  • Bill depth - a continuous quantitative variable

  • Body mass - a continuous quantitative variable

  • Penguin Species - a categorical variable with three levels

Adelie, Chinstrap, Gentoo

##Dummy variables

In order to represent a categorical variable in a model - we need to convert these cases to ones that are compatible with the maths.

Dummy variables, are coded with 0 and 1’s. With three species we would have the following:

  • Adelie[0,0] - the reference or intercept species

  • Chinstrap[1,0]

  • Gentoo[0,1]

Adelie, Chinstrap, Gentoo

Practice

lm(bill_length_mm ~ 
              bill_depth_mm +
              body_mass_g + 
              species,#<<
   data = penguins) %>% 
  broom::tidy(.)
# A tibble: 5 × 5
  term          estimate std.error statistic  p.value
  <chr>            <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   25.8      2.12         12.2  1.83e-28
2 bill_depth_mm  0.705    0.144         4.90 1.50e- 6
3 body_mass_g    0.00268  0.000349      7.69 1.59e-13
4 speciesGentoo -2.51     0.936        -2.68 7.70e- 3
5 speciesAdelie -9.90     0.340       -29.1  3.89e-94

##Practice

\[ \large{bill~length=bill~depth+body~mass+species} \]

\(\large{bill~length=15.9+0.71_{bill~depth}+0.003_{body~mass}+9.9_{speciesChinstrap}+7.39_{speciesGentoo}}\)

If everything else about two penguins is the same (bill depth and body mass), we would expect a Chinstrap penguin to have a bill length 9.9mm longer than an Adelie pengun (on average).

If everything else about two penguins is the same (bill depth and body mass), we would expect a Gentoo penguin to have a bill length 7.39mm longer than an Adelie pengun (on average).

Q. What would be the bill length for a Gentoo penguin with a bill depth of 13.2mm and a body mass of 4500g?

46.162mm

Interactions

Why?

  • What if you think that one of the predictor variables changes the WAY that another predictor variable influences the outcome?

  • Increases our understanding of predictor variables

Why not?

  • Makes coefficient interpretation more difficult

  • Increases model complexity

Example

The species of penguin changes the relationship between bill length and bill depth (e.g. the shape of their beaks are different).

lsmodel1 <- lm(bill_length_mm ~ 
              bill_depth_mm +
              body_mass_g + 
              species + 
              species:bill_depth_mm, #<<
            data = penguins)

lsmodel1

Call:
lm(formula = bill_length_mm ~ bill_depth_mm + body_mass_g + species + 
    species:bill_depth_mm, data = penguins)

Coefficients:
                (Intercept)                bill_depth_mm  
                  13.520816                     1.397187  
                body_mass_g                speciesGentoo  
                   0.002565                     4.886060  
              speciesAdelie  bill_depth_mm:speciesGentoo  
                  10.273922                    -0.324097  
bill_depth_mm:speciesAdelie  
                  -1.097191  

Additive vs Interaction

Additive model

No Interaction effect

Interaction model

Line fitting

Model summary

lsmodel1 %>% 
  broom::tidy()
# A tibble: 7 × 5
  term                        estimate std.error statistic  p.value
  <chr>                          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)                 13.5      4.51         3.00  2.93e- 3
2 bill_depth_mm                1.40     0.255        5.49  8.15e- 8
3 body_mass_g                  0.00257  0.000349     7.34  1.58e-12
4 speciesGentoo                4.89     5.50         0.888 3.75e- 1
5 speciesAdelie               10.3      5.31         1.93  5.40e- 2
6 bill_depth_mm:speciesGentoo -0.324    0.327       -0.991 3.23e- 1
7 bill_depth_mm:speciesAdelie -1.10     0.288       -3.81  1.67e- 4

F-test between nested models

\[F = \frac{{(SSE_1 - SSE_2) / (df_1 - df_2)}}{{SSE_2 / df_2}}\] Where \(SSE_1\) is the simpler/nested model; \(SSE_2\) is the full/original model

\(df_1\)is the degrees of freedom for the simpler model (number of observations - number of model parameters for the simpler model).

\(df_2\)is the degrees of freedom for the more complex model (number of observations - number of model parameters for the more complex model).

F-test between nested models

\[F = \frac{{(SSE_1 - SSE_2) / (df_1 - df_2)}}{{SSE_2 / df_2}}\]

anova(lsmodel2, lsmodel1)
Analysis of Variance Table

Model 1: bill_length_mm ~ bill_depth_mm + body_mass_g + species
Model 2: bill_length_mm ~ bill_depth_mm + body_mass_g + species + species:bill_depth_mm
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    337 1822.2                                  
2    335 1729.6  2    92.677 8.9754 0.0001595 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpret an interaction term

Once we add an interaction term we can no longer interpret that “on average bill length increases by 0.29mm for every 1mm increase in bill depth. Because bill depth also shows up as interaction term with species

We now have to interpret this as :

For a specific species of penguin e.g. Chinstrap

\(Bill~length=23.7+0.29_{bill depth}+0.002_{body mass}+-10.27_{Chinstrap}+1.09_{bill~depth*species}\)

“For Chinstrap penguin we expect a penguin with a bill depth of 13.2 mm to increase bill length by:”

\((0.002*body~mass)+(1*-10.2)+(0.002*1.09*1)\)

When considering interaction terms

  • Have a clear basis for why you expect an interaction (and visualise the data)

  • Know that your coefficients WILL change when you add an interaction

  • You can add the interaction term as a way to TEST the hypothesis for a significant interaction, then drop it from your model if not required.

Turning a hypothesis into a model

Turning a question into a model

  • Hypotheses in research predict relationships between variables.

  • Linear models help test these hypotheses by quantifying these relationships.

  • Converting hypotheses into linear models involves identifying dependent and independent variables.

  • Considering and including important covariates

  • Determining whether interaction terms need to be included

Simple linear model

  • Question: Does the amount of sunlight affect plant growth?

  • Hypothesis: More sunlight hours increase plant growth

  • Linear Model: height ~ sunlight_hours

Simple linear model

  • Question: Does the amount of sunlight affect plant growth?

  • Hypothesis: More sunlight hours increase plant growth

  • Control variables: Soil quality

  • Linear Model: height ~ sunlight_hours + soil quality

Multiple predictor model

  • Question: How do water and nutrient levels affect plant growth?

  • Hypothesis: Both increased water and nutrient levels lead to increased plant growth.

  • Control Variables: Temperature

  • Linear Model: growth ~ water + nutrient + temp_c

Multiple predictor model

  • Question: How do water and nutrient levels affect plant growth?

  • Hypothesis: Both increased water and nutrient levels lead to increased plant growth.

  • Control Variables: Temperature

  • Linear Model: growth ~ water + nutrient + temp_c + water:nutrient

Exercise 1

  • Question: Does temperature affect the metabolic rate of lizards?

  • Hypothesis: Temperature increases the metabolic rate of lizards

  • Control Variables: Age of lizards

  • Linear Model: metabolic_rate ~ temp + age

Exercise 2:

  • Question: How does diet and exercise affect cholesterol in mice?

  • Hypothesis: low fat diets and regular exercise decrease cholesterol levels in mice

  • Control Variables: Genetic strain of mice

  • Linear Model: cholesterol ~ diet_fat + daily_exercise_hours + strain

  • Linear Model: cholesterol ~ diet_fat + daily_exercise_hours + strain + diet_fat:daily_exercise

Exercise 3:

  • Question: Do immune system responses depend on stress and sleep in humans?

  • Hypothesis: Negative effects of stress on immune responses are lessened if sleep duration remains high

  • Control Variables: Age

  • Linear Model: immune_response ~ sleep_hours + stress + age + sleep_hours:stress

Questions?

Fitting models

How do I know my model is a good fit?

1. Statistics are no substitute for judgment

Have a clear hypothesis for why variables are being included (and whether they might interact)

2. Check assumptions & diagnostics

A model with poor diagnostics is a bad fit for the data – confidence intervals and p values will not be robust.

Model checks

  • Start by fitting a model that includes your dependent variable(s), covariates and specifies any key interactions

  • Check the model is a good fit for the observed data

  • Diagnostic plots of residuals

  • Multicollinearity

  • Omitted Variable Bias

  • Irrelevant variables

  • Hypothesis testing for model simplification

Model checks

lsmodel1 <- lm(bill_length_mm ~ 
              bill_depth_mm +
              body_mass_g + 
              species + 
              species:bill_depth_mm, #<<
            data = penguins)
library(performance)

check_model(lsmodel1, detrend = FALSE)

Linearity and heteroscedasticity

Collinearity and High Leverage

The next two plots analyze for collinearity and high leverage points.

High Leverage Points are observations that deviate far from the average. These can skew the predictions for linear models.

Collinearity is when features are highly correlated, which can throw off simple regression models. However, when we are using interaction terms - we expect these to be highly correlated with main effects.

Multicollinearity

Correlation between predictor variables in a linear model = High VIF

One pair = collinearity,

two or more = multicollinearity

Can be a result of an overspecified model (chucked everything in without thinking)

also common in observation datasets (rather than experimental datasets)

mean-centering continuous variables can reduce this

e.g. ecological data where variables may be dependent on each other

\(R^2\) may be high

BUT individual predictors may have high uncertainties and appear non-significant

Variance Inflation Factor

\(VIF_j={1\over{1-R^2_i}}\)

Each predictor is regressed (in turn) against the full model without that predictor present \(R^2_i\)

Normality of Residuals

If the distributions are skewed, this can indicate problems with the model.

Quantile-Quantile Plot: We can see that several points towards the end of the quantile plot do not fall along the straight-line.

Normal density Plot: The residuals fit a very nice normal distribution

Assumptions not met?

What if our data does not satisfy one or more of these assumptions?

Data transformations

If our assumptions are violated, then depending on the issue, estimates of the mean or uncertainty intervals may be biased, affecting our ability to determine statistical significance.

Transformations can help with:

  1. Heteroscedasticity (the opposite of Homogeneity) of variance

  2. Non-linearity

  3. Non-normal residual variance

  4. Outliers (sometimes) - might be possible to remove these

lm(sqrt(y) ~ x)

lm(log(y) ~ x)

Choosing a transformation

MASS::boxcox(lsmodel1)

Lambda value \(\lambda\) Transformed data (Y)
-2 Y^-2
-1 Y^-1
-0.5 Y^-0.5
0 log(Y)
0.5 sqrt(Y)
1 Y^1
2 Y^2

Omitted variable bias

Bias created when one or more predictor variables are incorrectly left out of a model

e.g. this predictor variable should have been included because you could reasonably expect it to influence the outcome.

  • Other terms may have their influence over – or underestimated ( it is not always clear which way round this will occur).

  • Standard errors and measures of uncertainty are wrong

  • Significance tests are biased

  • Predictions will be wrong

Irrelevant variables

The other side of the coin – Including irrelevant variables

  • Estimates will be unbiased but perhaps inefficient (greater variances, SE, CI)

  • Variances will be unbiased (but maybe not optimal for hypothesis testing)

  • Greater sample size can help reduce inconsistency and selection bias issues

  • Prediction will be unchanged

So: it’s better to INCLUDE an irrelevant predictor variable that to OMIT a relevant one if you aren’t sure.

Model fit checklist

Think HARD about what terms should be included in a model to test a hypothesis

  1. Which terms is it reasonable to test might interact with each other to affect y?

  2. Which terms might be showing signs of collinearity?

  3. Visualise your data

  4. Fit your most complex model (main terms and interactions)

  1. Check the model fit/diagnostics

    • Homogeneity of variance,

    • normal distribution of residuals,

    • collinearity

  2. Refit model (change terms, data transformation if necessary)

  3. Test removal of interaction terms

  4. Leave all main predictors in the model/refine model further.

Stepwise-removal

  • Stepwise regression is a popular tool for simplifying models and removing non-significant predictors

  • However standard statistical tests do not account for sequential testing

  • models can have inflated p-valyes and narrow confidence intervals

  • I recommend restricting this to testing whether interaction terms are required

Step away from stepwise

Testing whether interactions can be removed

\(R^2\) almost always increases as you add more terms and interactions. This does not increase your ability to test predictors.

The adjusted \(R^2\) may be more a more reliable measure.

anova() or drop1() both produce likelihood ratio F-tests when comparing nested models.

\(F = {({SSE_{full} - SSE_{reduced}})\over{df_{full} - df_{reduced}}}\)

\(F = {SSE\Delta\over df\Delta}\)

Types of sums of squares

There are three types of sums of squares I, II & III

By default anova() implements type I

When data are balanced these will all be the same. For unbalanced designs these may not

drop1()

The drop1() function compares all possible models that can be constructed by dropping a single model term. As such it produces the most robust sum of squares for F-test with unbalanced model design.

drop1(lsmodel1, test = "F")
Single term deletions

Model:
bill_length_mm ~ bill_depth_mm + body_mass_g + species + species:bill_depth_mm
                      Df Sum of Sq    RSS    AIC F value    Pr(>F)    
<none>                             1729.6 568.32                      
body_mass_g            1   278.504 2008.1 617.38 53.9437 1.583e-12 ***
bill_depth_mm:species  2    92.677 1822.2 582.17  8.9754 0.0001595 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note it drops all terms that can be dropped e.g. species cannot be dropped here, because it is in the interaction term.

Equivalent to anova(reduced model, full model)

drop1()

lsmodel1a <- lm(bill_length_mm ~ bill_depth_mm + body_mass_g + species,
            data = penguins)

drop1(lsmodel1a, test = "F")
Single term deletions

Model:
bill_length_mm ~ bill_depth_mm + body_mass_g + species
              Df Sum of Sq    RSS     AIC F value    Pr(>F)    
<none>                     1822.2  582.17                      
bill_depth_mm  1     129.7 1952.0  603.69  23.992 1.504e-06 ***
body_mass_g    1     320.1 2142.3  635.51  59.194 1.591e-13 ***
species        2    4714.8 6537.1 1015.05 435.976 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If an interaction is dropped from a full model, then drop1() could be re-run. This can be done to simplify the model, or just accurately acquire F values only for the main effects also involved in an interaction.

Reporting

Any estimates or confidence intervals should come from full model.

gtsummary::tbl_regression(lsmodel1)

Characteristic

Beta

95% CI

1

p-value

bill_depth_mm 1.4 0.90, 1.9 <0.001
body_mass_g 0.00 0.00, 0.00 <0.001
species


    Chinstrap
    Gentoo 4.9 -5.9, 16 0.4
    Adelie 10 -0.18, 21 0.054
bill_depth_mm * species


    bill_depth_mm * Gentoo -0.32 -0.97, 0.32 0.3
    bill_depth_mm * Adelie -1.1 -1.7, -0.53 <0.001
1

CI = Confidence Interval

AIC

  • The Akaike Information Criterion (AIC) balances the trade-off between model fit and complexity by penalizing models with more parameters.

  • It aims to find the model that best explains the data while avoiding overfitting.

  • AIC is calculated as follows: \(\text{AIC} = 2k - 2\ln(\hat{L})\),

  • where: log-likelihood is the maximized value of the likelihood function for the fitted model. \(k\) is the number of parameters in the model.

  • In general a difference in AIC of <2 suggests equivalent goodness-of-fit

AIC

AIC can:

  • Compare non-nested models fitted to the same dataset and response variable

  • Compare across different GLM families and link functions

AIC cannot:

  • Compare models where the response variable is transformed

  • Compare models where there are differences in the dataset

AIC(lsmodel1, lsmodel1a)
          df      AIC
lsmodel1   8 1540.871
lsmodel1a  6 1554.723

AIC

drop1(lsmodel1, test = "F")
Single term deletions

Model:
bill_length_mm ~ bill_depth_mm + body_mass_g + species + species:bill_depth_mm
                      Df Sum of Sq    RSS    AIC F value    Pr(>F)    
<none>                             1729.6 568.32                      
body_mass_g            1   278.504 2008.1 617.38 53.9437 1.583e-12 ***
bill_depth_mm:species  2    92.677 1822.2 582.17  8.9754 0.0001595 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model dredging

library(MuMIn)

model <- lm(bill_length_mm ~ bill_depth_mm * body_mass_g * species,
            data = penguins1, na.action = "na.fail")

model_dredge <- dredge(model)

head(model_dredge, n = 3)
Global model call: lm(formula = bill_length_mm ~ bill_depth_mm * body_mass_g * species, 
    data = penguins1, na.action = "na.fail")
---
Model selection table 
   (Int) bll_dpt_mm bdy_mss_g spc bll_dpt_mm:bdy_mss_g bll_dpt_mm:spc
24 48.26     -1.397 -0.002565   +                                   +
32 48.26     -1.385 -0.002565   +           -6.338e-05              +
56 47.58     -1.590 -0.001623   +                                   +
   bdy_mss_g:spc df   logLik   AICc delta weight
24                8 -762.436 1541.3  0.00  0.619
32                9 -762.353 1543.2  1.94  0.234
56             + 10 -761.754 1544.2  2.87  0.147
Models ranked by AICc(x) 

Model averaging

model.avg(model_dredge)
Estimate Std. Error t p lower95 upper95
(Intercept) 48.13 0.63 76.26 0.00 46.90 49.37
bill_depth_mm -1.43 0.28 -5.01 0.00 -1.98 -0.87
body_mass_g 0.00 0.00 -3.67 0.00 0.00 0.00
speciesGentoo -0.61 1.23 -0.50 0.31 -3.03 1.80
speciesAdelie -8.36 0.71 -11.77 0.00 -9.76 -6.97
bill_depth_mm:speciesGentoo 0.37 0.40 0.92 0.18 -0.41 1.15
bill_depth_mm:speciesAdelie 1.15 0.33 3.49 0.00 0.50 1.79
bill_depth_mm:body_mass_g 0.00 0.00 -0.21 0.42 0.00 0.00
body_mass_g:speciesGentoo 0.00 0.00 -0.29 0.39 0.00 0.00
body_mass_g:speciesAdelie 0.00 0.00 -0.35 0.36 0.00 0.00
bill_depth_mm:body_mass_g:speciesGentoo 0.00 0.00 0.14 0.45 0.00 0.00
bill_depth_mm:body_mass_g:speciesAdelie 0.00 0.00 0.03 0.49 0.00 0.00

Issues with Model averaging

  • Multi-collinearity: When predictors are highly correlated, the AIC weights can become unreliable. Multi-collinearity can lead to inflated standard errors and unstable coefficient estimates, which affects the calculation of AIC weights.

  • The AIC model weights apply to an entire model and not any individual predictor variables within that model and, thus, have minimal information content about contributions of individual predictor variables to predicted responses

  • Does model averaging make sense

  • Model averaging and muddled multimodel inferences

Generalized Linear Models

Recapping general linear models

  • Also known at the Ordinary Least Squares (OLS) model

  • It aims to fit a regression line which minimises the squared differences between observed and predicted values

ordinary least squares

Equation of the line

The equation of a linear model (lm) is given by:

\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]

where:

  • \(y_i\) is the predicted value of the response variable.

  • \(\beta_0\) is the intercept.

  • \(\beta_1\) is the slope coefficient, representing the change in the response variable for a one-unit change in the explanatory variable.

  • \(x_i\) is the value of the explanatory variable.

  • \(\epsilon_i\) represents the residuals of the model

Limitations of general linear models

Q. What are the assumptions of a general linear model?

  • Assumes a linear relationship

  • Assumes normal distribution of the residuals

  • Assumes homogeneity of the residuals

  • Independence - assumes observations are independent of each other

Normal distribution of the residuals

The linear regression line is the most likely line given your data if we assume each data point comes from a hypothetical bell curve centered on the regression line

probability

Normal distribution

Normal distribution of the residuals

When using our residuals to calculate standard error, confidence intervals and statistical significance these are assumed to be drawn from:

  • a normal distribution with mean zero

  • with a constant variance.


This implies that the residuals, or the distances between the observed values and the values predicted by the linear model, can be modeled by drawing random values from a normal distribution.

The linear model equation

Another way to write the lm equation is:

\[ y_i \sim N(\mu = \beta_0 + \beta_1 X_i, \sigma^2) \]

Which literally means that \(y_i\) is drawn from a normal distribution with parameters:

  • \(\mu\) (which depends on \(x_i\))

  • \(\sigma\) (which has the same value for all measures of \(Y\)).

Real data

How often do these assumptions really hold true?

A good fit:

Real data

How often do these assumptions really hold true?

An ok fit:

Real data

How often do these assumptions really hold true?

A poor fit:

Assumption testing

Residuals vs Fitted

  • Purpose: To check for linearity and homoscedasticity.

  • Interpretation: Look for a random scatter of points around the horizontal line at y = 0. A funnel-shaped pattern suggests heteroscedasticity.

QQ-plot

  • Purpose: To assess the normality of the residuals

  • Interpretation: Points should fall along the diagonal line. Deviation from the line indicates non-normality of residuals.

Scale-location

  • Purpose: To check homoscedasticity and identify outliers

  • Interpretation: Uses standardised residuals. Constant spread indicates homoscedasticity

Residuals vs. Leverage

  • Purpose: To identify influential data points(outliers)

  • Interpretation: Loking for points with high leverage that might affect the regression line. Investigate values > 0.5

Performance

The performance package from easystats can produce (among other things) similar plots.

Formal tests

  • For linear models we can also run some formal tests on the residuals
# lmtest::bptest(model)

performance::check_normality(model)
OK: residuals appear as normally distributed (p = 0.825).
# shapiro.test(residuals(model))

performance::check_heteroscedasticity(model)
OK: Error variance appears to be homoscedastic (p = 0.428).
  • This becomes more difficult with generalised models

  • Being able to interpret residual plots is important

Generalised Linear Models

What are Generalised linear models?

  • The Generalised Linear Model (GLM) is an extension of the ordinary linear regression model that accommodates a broader range of response variable types and error distributions.

  • Key components:

  • Linear Predictor: Combines predictor variables linearly.

  • Link Function: Links the mean of the response variable to the linear predictor.

  • Error family: Drawn from the exponential family it determines the shape of the response variable’s distribution.

Linear Predictor

Linear models:

\(y_i = \beta0 + \beta_1x1 + \beta_2x2 + ... +\beta_nxn +\epsilon_i\)


Then generalised linear models…

\(\eta_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \ldots + \beta_n x_{ni}\)

  • \(\eta_i\) the linear prediction for observation \(i\)

Error family

Let’s start with linear models…

\(y_i = \beta0 + \beta_1x1 + \beta_2x2 + ... +\beta_nxn +\epsilon_i\)

\(\epsilon_i = \mathcal{N}(0, \sigma^2)\)

Then generalised linear models…

\(\mu_i = f(\eta_i)\)

\(\epsilon_i \sim \mathcal{N}(0, \sigma^2)\)

  • \(\epsilon_i\) is the error term for observation \(i\)

Examples: Gaussian (normal), binomial, Poisson, gamma, etc.

Generalised Linear Model

Linear Predictor (\(\eta\)):

\(\eta_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \ldots\)

Link Function (\(g\)):

\(g(\mu_i) = \eta_i\)

Probability Distribution:

\(\mu_i = f(\eta_i)\)

\(\epsilon_i \sim \mathcal{N}(0, \sigma^2)\)

  • \(\eta_i\) linear prediction for observation \(i\)

  • \(g\) the link function relating the linear predictor to the expected value (mean) of the response variable

  • \(\mu_i\) is the expected value of the response variable for observation \(i\)

  • \(f\) the inverse of the link function, transforming the linear predictor back to the response variable’s expected value

  • \(\epsilon_i\) is the error term for observation \(i\)

Changing model structures

There are a range of model structures available.

The choice of error family and link influence how well the model fits

Exponential Error Family Link Function
Gaussian Identity, sqrt, log, inverse
Binomial Logit, probit, cloglog
Poisson Log, identity, sqrt
Gamma Inverse, identity, log, sqrt
Negative Binomial Log, identity, sqrt

Choosing an error family

  • What type of variable is the outcome?

  • Fit models: check residuals and AICs

physalia logo

Questions?

A Linear Model Example

Janka timber hardness

This regression analysis uses data from the Australian forestry industry recording wood density and timber hardness from 36 different samples

  • Timber hardness is quantified on the “Janka” scale

  • The ‘amount of force required to embed a 0.444” steel ball into the wood to half of its diameter’.

Fitting the general linear model

janka_ls <- lm(hardness ~ dens, data = janka)

summary(janka_ls)
janka |> 
  ggplot(aes( x = dens, y = hardness))+
  geom_point()+
  geom_smooth(method = "lm")

Call:
lm(formula = hardness ~ dens, data = janka)

Residuals:
    Min      1Q  Median      3Q     Max 
-338.40  -96.98  -15.71   92.71  625.06 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1160.500    108.580  -10.69 2.07e-12 ***
dens           57.507      2.279   25.24  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 183.1 on 34 degrees of freedom
Multiple R-squared:  0.9493,    Adjusted R-squared:  0.9478 
F-statistic:   637 on 1 and 34 DF,  p-value: < 2.2e-16

Challenge: Fit a Linear Model



  • Fit the Linear Model lm(hardness ~ dens, data = janka) to the janka data.

  • Evaluate the model fit

10:00

Model diagnostics

plot(janka_ls, which=2)
plot(janka_ls, which=3)

Violations of assumptions

library(lmtest)

# Breusch-Pagan Test of Homoscedasticity
lmtest::bptest(janka_ls)

    studentized Breusch-Pagan test

data:  janka_ls
BP = 4.4668, df = 1, p-value = 0.03456
# Shapiro-Wilk test for normality of residuals
shapiro.test(residuals(janka_ls))

    Shapiro-Wilk normality test

data:  residuals(janka_ls)
W = 0.93641, p-value = 0.03933

Challenge



  • What can we try to improve the fit of our lm(hardness ~ dens, data = janka) ?


  • Let’s make suggestions then try them out:
15:00

BoxCox

The R output for the MASS::boxcox() function plots a maximum likelihood curve (with a 95% confidence interval - drops down as dotted lines) for the best transformation for fitting the dependent variable to the model.

lambda value transformation
0 log(Y)
0.5 sqrt(Y)
1 Y
2 Y^1
MASS::boxcox(janka_ls)

Square root transformation

janka_sqrt <- lm(sqrt(hardness) ~ dens, data = janka)

plot(janka_sqrt, which=2)
plot(janka_sqrt, which=3)

Model fit

ggplot(janka, aes(x = hardness, y = dens)) +
  geom_point() +  # scatter plot of original data points
  geom_smooth(method = "lm", formula = y ~ sqrt(x)) +  # regression line
  labs(title = "Sqrt Linear Regression with ggplot2",
       x = "X", y = "Y")  # axis labels and title

Log transformation

janka_log <- lm(log(hardness) ~ dens, data = janka)

plot(janka_log, which=2)
plot(janka_log, which=3)

Model fit

ggplot(janka, aes(x = hardness, y = dens)) +
  geom_point() +  # scatter plot of original data points
  geom_smooth(method = "lm", formula = (y ~ log(x))) +  # regression line
  labs(title = "Log Linear Regression",
       x = "X", y = "Y")  # axis labels and title

Polynomials

janka_poly <- lm(log(hardness) ~ poly(dens, 2), data = janka)

plot(janka_poly, which=2)
plot(janka_poly, which=3)

Polynomial fit

summary(janka_poly)
ggplot(janka, aes(x = hardness, y = dens)) +
  geom_point() +  # scatter plot of original data points
  geom_smooth(method = "lm", formula = (y ~ poly(log(x), 2))) +  # regression line
  labs(title = "Quadratic Log Linear Regression",
       x = "X", y = "Y")  # axis labels and title

Call:
lm(formula = log(hardness) ~ poly(dens, 2), data = janka)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.22331 -0.05708 -0.01104  0.07500  0.18871 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      7.1362     0.0168 424.896  < 2e-16 ***
poly(dens, 2)1   3.3862     0.1008  33.603  < 2e-16 ***
poly(dens, 2)2  -0.5395     0.1008  -5.354 6.49e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1008 on 33 degrees of freedom
Multiple R-squared:  0.9723,    Adjusted R-squared:  0.9706 
F-statistic: 578.9 on 2 and 33 DF,  p-value: < 2.2e-16

Weighted least squares

janka_wls <- lm(sqrt(hardness) ~ dens, weights = 1/sqrt(hardness), data = janka)
  • Weights are specified as \(\frac{1}{\sqrt{\text{hardness}}}\), indicating that observations with higher hardness values will have lower weights, and vice versa.

  • This weighting scheme can assign more importance to certain observations in the model fitting process.

  • This approach is known as weighted least squares (WLS) regression, where observations are weighted differently based on certain criteria, such as the square root of hardness.

Weighted least squares

janka_wls <- lm(sqrt(hardness) ~ dens, weights = 1/sqrt(hardness), data = janka)

plot(janka_wls, which=2)
plot(janka_wls, which=3)

Model fit


Call:
lm(formula = sqrt(hardness) ~ dens, data = janka, weights = 1/sqrt(hardness))

Weighted Residuals:
     Min       1Q   Median       3Q      Max 
-0.55072 -0.19009 -0.03393  0.19082  0.63845 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.01819    0.96647   2.088   0.0443 *  
dens         0.76142    0.02201  34.601   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.294 on 34 degrees of freedom
Multiple R-squared:  0.9724,    Adjusted R-squared:  0.9716 
F-statistic:  1197 on 1 and 34 DF,  p-value: < 2.2e-16

Troublesome transformations

  • Changes interpretability

  • Can introduce NEW violations of assumptions

  • Error and Mean change together

  • We don’t understand the true structure of our data

A Generalised Linear Model Example

A Generalised linear model approach

  • Fit a glm(gaussian(link = "identity")) to the data

  • Check that the model is identical to the lm()

05:00
janka_glm <- glm(hardness ~ dens, data = janka, family = gaussian(link = "identity"))

summary(janka_glm)

Call:
glm(formula = hardness ~ dens, family = gaussian(link = "identity"), 
    data = janka)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1160.500    108.580  -10.69 2.07e-12 ***
dens           57.507      2.279   25.24  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 33510.78)

    Null deviance: 22485041  on 35  degrees of freedom
Residual deviance:  1139366  on 34  degrees of freedom
AIC: 481.21

Number of Fisher Scoring iterations: 2

A Generalised linear model approach

We previously identified that the square root of the dependent variable might have been a better fit with our linear model:

janka_glm <- glm(hardness ~ dens, data = janka, family = gaussian(link = "sqrt"))

summary(janka_glm)

Call:
glm(formula = hardness ~ dens, family = gaussian(link = "sqrt"), 
    data = janka)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.60869    1.47660   1.767   0.0863 .  
dens         0.75194    0.02722  27.622   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 25498.81)

    Null deviance: 22485041  on 35  degrees of freedom
Residual deviance:   866960  on 34  degrees of freedom
AIC: 471.38

Number of Fisher Scoring iterations: 4

Plot model

ggplot(janka, aes(x = dens, y = hardness)) +
  geom_point() +  # scatter plot of original data points
  geom_smooth(method = "glm", method.args = list(gaussian(link = "sqrt"))) +  # regression line
  labs(title = "Linear Regression with ggplot2",
       x = "X", y = "Y")  # axis labels and title

Residuals

janka_glm <- glm(hardness ~ dens, data = janka, family = gaussian(link = "sqrt"))

plot(janka_glm, which=2)
plot(janka_glm, which=3)

Choosing an Error Family

library(fitdistrplus)

descdist(janka$hardness, boot = 500)

summary statistics
------
min:  413   max:  3260 
median:  1195 
mean:  1469.472 
estimated sd:  801.5172 
estimated skewness:  0.6579162 
estimated kurtosis:  2.571426 
  • A Cullen and Frey plot (skewness-kurtosis)

  • A graphical view of the distribution of the dependent variable

  • Reference lines represent expected skew and kurtosis for different distributions

  • Beta family sits outside of GLM exponential families

Choosing an Error family

library(fitdistrplus)

par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))
fg <- fitdist(janka$hardness, "gamma")
fn <- fitdist(janka$hardness, "norm")
plot.legend <- c("normal", "gamma")
denscomp(list(fn, fg), legendtext = plot.legend)
qqcomp(list(fn, fg), legendtext = plot.legend)

Choosing based on AIC

summary(fg)
Fitting of the distribution ' gamma ' by maximum likelihood 
Parameters : 
         estimate   Std. Error
shape 3.351431971 0.4076228722
rate  0.002280106 0.0002166559
Loglikelihood:  -287.9889   AIC:  579.9778   BIC:  583.1448 
Correlation matrix:
          shape      rate
shape 1.0000000 0.7201126
rate  0.7201126 1.0000000
summary(fn)
Fitting of the distribution ' norm ' by maximum likelihood 
Parameters : 
      estimate Std. Error
mean 1469.4722  131.71673
sd    790.3066   93.13779
Loglikelihood:  -291.2889   AIC:  586.5779   BIC:  589.7449 
Correlation matrix:
     mean sd
mean    1  0
sd      0  1

Gamma

\(f(x;\alpha,\beta) = \frac{\beta^\alpha x^{\alpha - 1} e^{-\beta x}}{\Gamma(\alpha)}\)

  • Shape (α): Determines the shape or curve of the distribution. Higher values of α result in distributions that are more peaked and skewed to the left.

  • Rate (β): Influences the spread of the distribution. Smaller values of β result in distributions that are more spread out, larger values of β lead to distributions that are less spread out.

  • Useful for modelling positively skewed continuous non-negative data

Gamma GLM

janka_gamma <- glm(hardness ~ dens, data = janka, family = Gamma(link = "sqrt"))

summary(janka_gamma)

Call:
glm(formula = hardness ~ dens, family = Gamma(link = "sqrt"), 
    data = janka)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.86729    0.90357   2.067   0.0465 *  
dens         0.76780    0.02243  34.224   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Gamma family taken to be 0.009716208)

    Null deviance: 11.26788  on 35  degrees of freedom
Residual deviance:  0.32876  on 34  degrees of freedom
AIC: 452.97

Number of Fisher Scoring iterations: 4

Plot model

ggplot(janka, aes(x = dens, y = hardness)) +
  geom_point() +  # scatter plot of original data points
  geom_smooth(method = "glm", method.args = list(Gamma(link = "sqrt"))) +  # regression line
  labs(title = "Linear Regression with ggplot2",
       x = "X", y = "Y")  # axis labels and title

Exercise: Gamma

  • Fit the Gamma distribution linear model to the data.

  • glm(hardness ~ dens, data = janka, family = Gamma(link = "sqrt")

  • Evaluate the model fit against other glm fits.

10:00

Questions?

Likelihood

Log-Likelihood

  • Likelihood measures the probability of observing the given data under a specific statistical model, as a function of the model’s parameters.

  • It represents how well the model explains the observed data.

  • The log likelihood function in maximum likelihood estimations is usually computationally simpler as it is additive.

The log-likelihood (l) maximum is the same as the likelihood (L) maximum.

Log-Likelihood

Likelihood Function:

\(L(\theta | y) = f(y_1 | \theta) \times f(y_2 | \theta) \times \ldots \times f(y_n | \theta)\)


Log-Likelihood Function:

\(\ell(\theta | y) = \log(f(y_1 | \theta)) + \log(f(y_2 | \theta)) + \ldots + \log(f(y_n | \theta))\)


Although log-likelihood functions are mathematically easier than their multiplicative counterparts, they can be challenging to calculate by hand. They are usually calculated with software.

Maximum Likelihood Estimation

For a specified error family distribution and link function, the glm finds parameter values that have the highest likelihood score.

  • Which estimates are most probable for producing the observed data

  • This is an iterative process, without a single defined equation

  • For gaussian and identity link it will produce the same outcome as OLS.

Log-Likelihood Function: \[\ell(\theta | \mathbf{y}) = \sum_{i=1}^{n} \log(f(y_i | \theta))\]

Maximum Likelihood

  • A contour plot visualizing the log-likelihood surface for a normal distribution fitted to simulated data.

  • The contour lines on the plot represent regions of equal log-likelihood values. Darker regions indicate lower log-likelihood values, while lighter regions indicate higher log-likelihood values

Exercise: log-likelihood



  • Use the log-likelihood function with varying values of the mean and sd
05:00

Deviance

  • The deviance is a key concept in Generalised linear models.

  • It measures the deviance of the fitted Generalised linear model with respect to a perfect/saturated model for the sample.

Pseudo R-squared

Formula:

\(GLM~R^2 = 1 - \frac{D}{D_0}\)

\(linear\_model~R^2= 1 - \frac{SSE}{SST}\)

  • \(D\) is the deviance of the fitted model.

  • \(D_0\) is the deviance of the null model (linear model with only an intercept).

The GLM \(R^2\) is not the percentage of variance explained by the model, but rather a ratio indicating how close the fit is to being perfect.

Pseudo R-squared in R

  • We can fit the deviance-based pseudo \(R^2\) with the MuMIn package

  • r.squaredLR() is a function used to compute the likelihood ratio (LR) based R-squared for a given model

library(MuMIn)

r.squaredLR(janka_gamma)
[1] 0.9722591
attr(,"adj.r.squared")
[1] 0.9722593

AIC

  • The Akaike Information Criterion (AIC) balances the trade-off between model fit and complexity by penalizing models with more parameters.

  • It aims to find the model that best explains the data while avoiding overfitting.

  • AIC is calculated as follows: \(\text{AIC} = 2k - 2\ln(\hat{L})\),

  • where: log-likelihood is the maximized value of the likelihood function for the fitted model. \(k\) is the number of parameters in the model.

  • In general a difference in AIC of <2 suggests equivalent goodness-of-fit

AIC

AIC can:

  • Compare non-nested models fitted to the same dataset and response variable

  • Compare across different GLM families and link functions

AIC cannot:

  • Compare models where the response variable is transformed

  • Compare models where there are differences in the dataset

Exercise: Compare models

  • Revisit the janka models

  • Compare R-squared fits and AIC values

  • r.squaredLR() and AIC()

10:00

Likelihood ratio test

The likelihood ratio test (LRT) “unifies” frequentist statistical tests. Brand-name tests like t-test, F-test, chi-squared-test, are specific cases (or even approximations) of the LRT.

  • The likelihood ratio test (LRT) begins with a comparison of the likelihood scores of the two models:

\(\text{LR} = -2(\ln L_1 - \ln L_2)\)

Where:

  • \(\ln L_1\) and \(\ln L_2\) are the log-likelihoods of the two models being compared.

Likelihood ratio test

  • This LRT statistic approximately follows a chi-square distribution.

  • To determine if the difference in likelihood scores among the two models is statistically significant, we next must consider the degrees of freedom.

  • In the LRT, degrees of freedom is equal to the number of additional parameters in the more complex model.

LR <- -2*(null_likelihood[1]-fitted_likelihood[1])

LR
[1] 129.0546

Likelihood ratio test

  • The likelihood ratio test lrtest() is typically calculated as twice the difference in deviance between the two models. (Can also be carried out with the anova() function)

  • Scaling by -2 ensures the test statistic follows a chi-square distribution under certain conditions

  • We can calculate the likelihood ratio test statistic, compare it to the chi-square distribution with the appropriate degrees of freedom, and determine whether the improvement in fit between the two models is statistically significant.

Likelihood ratio test

Model 1: hardness ~ 1
Model 2: hardness ~ dens
  #Df  LogLik Df  Chisq Pr(>Chisq)    
1   2 -288.01                         
2   3 -223.48  1 129.05  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

F distribution

What if the variance is unknown e.g. Gaussian and Gamma distributions?

  • Then the null model estimates one parameter while the alternative model estimates two, so the difference in df is still 1.

  • In this case, we know this more familiarly as the ANOVA or F-test, which in this example is equivalent to the t-test on the intercept.


Call:
glm(formula = hardness ~ dens, family = Gamma(link = "sqrt"), 
    data = janka)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.86729    0.90357   2.067   0.0465 *  
dens         0.76780    0.02243  34.224   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Gamma family taken to be 0.009716208)

    Null deviance: 11.26788  on 35  degrees of freedom
Residual deviance:  0.32876  on 34  degrees of freedom
AIC: 452.97

Number of Fisher Scoring iterations: 4

F distribution

What if the variance is unknown e.g. Gaussian and Gamma distributions?

anova(null_model, fitted_model, test = "F")
Analysis of Deviance Table

Model 1: hardness ~ 1
Model 2: hardness ~ dens
  Resid. Df Resid. Dev Df Deviance      F    Pr(>F)    
1        35    11.2679                                 
2        34     0.3288  1   10.939 1125.9 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
drop1(fitted_model, test = "F")
Single term deletions

Model:
hardness ~ dens
       Df Deviance     AIC F value    Pr(>F)    
<none>      0.3288  452.97                      
dens    1  11.2679 1576.83  1131.3 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Challenge: Fruitfly

  • Using the fruitfly.csv dataset

  • Build a model to investigate the effect of sleep, mating status and body size on longevity

  • Check the model fits and report findings

30:00

Confidence intervals

# Assuming 'model' is your fitted GLM model

# Extract coefficients and confidence intervals
coef_ci <- confint(fly_gamma)

# Extract estimates
estimates <- coef(fly_gamma)

# Combine estimates with confidence intervals
estimates_df <- data.frame(
  estimates = estimates,
  lower_ci = coef_ci[,1],
  upper_ci = coef_ci[,2]
)

# Display dataframe
estimates_df
                  estimates     lower_ci    upper_ci
(Intercept)     -62.3343118 -79.00681086 -44.6346129
typeInseminated   3.3775829  -2.55299124   9.0437595
typeVirgin      -14.1795176 -19.65196164  -9.1009370
thorax          150.6246221 129.55928920 170.8932705
sleep             0.0238678  -0.08213297   0.1373446

Confidence intervals

Broom is a tidyverse package for with helpers to extract and convert models into tibbles:

# Alternatively, using package-specific functions
# Example with 'broom' package
library(broom)
tidy(fly_gamma, conf.int = T)
# A tibble: 5 × 7
  term            estimate std.error statistic  p.value conf.low conf.high
  <chr>              <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)     -62.3       8.55      -7.29  3.65e-11 -79.0      -44.6  
2 typeInseminated   3.38      2.97       1.14  2.58e- 1  -2.55       9.04 
3 typeVirgin      -14.2       2.70      -5.26  6.47e- 7 -19.7       -9.10 
4 thorax          151.       10.3       14.6   2.52e-28 130.       171.   
5 sleep             0.0239    0.0576     0.414 6.79e- 1  -0.0821     0.137

Extracting main effects

  • Estimates and confidence intervals are set agains the intercept

  • Hypothesis testing of factors with more than one level requires lrt:

drop1(fly_gamma, test = "F")
Single term deletions

Model:
longevity ~ type + thorax + sleep
       Df Deviance     AIC  F value    Pr(>F)    
<none>      4.8907  961.62                       
type    2   8.0073 1034.12  38.2348 1.423e-13 ***
thorax  1  10.9946 1109.45 149.7658 < 2.2e-16 ***
sleep   1   4.8982  959.81   0.1848     0.668    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Estimates and post-hoc

The emmeans package in R computes estimated marginal means (EMMs) it can also provides post-hoc comparisons and contrasts for fitted models.

By default the package computes marginal means with 95% confidence intervals for the average of each variable

emmeans::emmeans(fly_gamma, 
                 specs = pairwise ~ type + thorax + sleep,
                 type = "response")
$emmeans
 type        thorax sleep emmean   SE  df lower.CL upper.CL
 Control      0.821  23.5   61.9 2.42 120     57.1     66.7
 Inseminated  0.821  23.5   65.3 1.78 120     61.7     68.8
 Virgin       0.821  23.5   47.7 1.32 120     45.1     50.3

Confidence level used: 0.95 

$contrasts
 contrast                                                                 
 Control thorax0.82096 sleep23.464 - Inseminated thorax0.82096 sleep23.464
 Control thorax0.82096 sleep23.464 - Virgin thorax0.82096 sleep23.464     
 Inseminated thorax0.82096 sleep23.464 - Virgin thorax0.82096 sleep23.464 
 estimate   SE  df t.ratio p.value
    -3.38 2.97 120  -1.136  0.4940
    14.18 2.70 120   5.257  <.0001
    17.56 2.11 120   8.326  <.0001

P value adjustment: tukey method for comparing a family of 3 estimates 

Tables

  • There are multiple packages available that provide ways to present publication ready model summaries
library(gtsummary)

tbl_regression(fly_gamma)

library(sjPlot)

tab_model(fly_gamma)

Summary table & Figures

Characteristic

Beta

95% CI

1

p-value

type


    Control
    Inseminated 3.4 -2.6, 9.0 0.3
    Virgin -14 -20, -9.1 <0.001
thorax 151 130, 171 <0.001
sleep 0.02 -0.08, 0.14 0.7
1

CI = Confidence Interval

GLMs for binary data

GLMS for binary data

Common response variable in ecological datasets is the binary variable: we observe a phenomenon X or its “absence”

  • Presence/Absence of a species

  • Presence/Absence of a disease

  • Success/Failure to observe behaviour

  • Survival/Death of organisms

Wish to determine if \(P/A∼ Variable\)

Called a logistic regression or logit model

Binary variables

  • Investigating mayfly abundance in relation to pollution gradient measured in Cumulative Criterion Units (CCU)

  • Mayflies serve as indicators of stream health due to sensitivity to metal pollution

  • Higher CCU values indicate increased metal pollution, potentially impacting mayfly presence or absence in the stream

Ephemera vulgata

Example: Mayflies

Fitting a linear regression vs a logit-link regression:

Bernoulli Distribution

The Bernoulli distribution models the probability of success or failure in a single trial of a binary experiment:

  • where success occurs with probability \(p\)

  • and failure with probability \({1-p}\)

Probability distribution

\(logit(p) = \log \frac{p}{1 - p}\)

\(y_i = Bernoulli(p) = \frac{p}{1 - p}\)

Mean of distribution Probability (p) of observing an outcome

Variance of observed responses As observed responses approach 0 or 1, the variance of the distribution decreases

Probability, odds, logit-odds

  • The odds put our expected values on a 0 to +Inf scale.

  • The log transformation puts our expected values on a -Inf to +Inf scale.

  • Now, the expected values can be linearly related to the linear predictor.

Example

  • Can you recreate the binomial glm for mayfly~ccu data?
15:00
mayfly_glm <- glm(Occupancy ~ CCU, family = binomial(link = "logit"), data = Mayflies)

summary(mayfly_glm)

Call:
glm(formula = Occupancy ~ CCU, family = binomial(link = "logit"), 
    data = Mayflies)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)    5.102      2.369   2.154   0.0313 *
CCU           -3.051      1.211  -2.520   0.0117 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 34.795  on 29  degrees of freedom
Residual deviance: 12.649  on 28  degrees of freedom
AIC: 16.649

Number of Fisher Scoring iterations: 7

Interpret the model

Each coefficient corresponds to a predictor variable, indicating the change in the log odds of the response variable associated with a one-unit change in that predictor, holding other predictors constant.


  • Interpretation of coefficients involves exponentiating them to obtain odds ratios.

  • An odds ratio greater than 1 implies higher odds of the event occurring with an increase in the predictor.

  • An odds ratio less than 1 implies lower odds of the event occurring with an increase in the predictor.

Intepret the model

Probability Odds Log Odds Verbiage
\(p=.95\) \(\frac{95}{5} = \frac{19}{1} = 19\) 2.94 19 to 1 odds for
\(p=.75\) \(\frac{75}{25} = \frac{3}{1} = 3\) 1.09 3 to 1 odds for
\(p=.50\) \(\frac{50}{50} = \frac{1}{1} = 1\) 0 1 to 1 odds
\(p=.25\) \(\frac{25}{75} = \frac{1}{3} = 0.33\) -1.11 3 to 1 odds against
\(p=.05\) \(\frac{95}{5} = \frac{1}{19} = 0.0526\) -2.94 19 to 1 odds against

Logit odds

When we use a binomial model, we produce the ‘log-odds’, this produces a fully unconstrained linear regression as anything less than 1, can now occupy an infinite negative value -∞ to ∞.

\[\log\left(\frac{p}{1-p}\right) = \beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}\]

\[\frac{p}{1-p} = e^{\beta_{0}}e^{\beta_{1}x_{1}}e^{\beta_{2}x_{2}}\]

  • We can interpret \(\beta_{1}\) and \(\beta_{2}\) as the increase in the log odds for every unit increase in \(x_{1}\) and \(x_{2}\).

  • We could alternatively interpret \(\beta_{1}\) and \(\beta_{2}\) using the notion that a one unit change in \(x_{1}\) as a percent change of \(e^{\beta_{1}}\) in the odds.

For the intercept:

  • The estimate as log odds is 5.102.

  • Therefore the odds are 164

  • Over 99% probability of observing mayfly when CCU = 0


Call:
glm(formula = Occupancy ~ CCU, family = binomial(link = "logit"), 
    data = Mayflies)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)    5.102      2.369   2.154   0.0313 *
CCU           -3.051      1.211  -2.520   0.0117 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 34.795  on 29  degrees of freedom
Residual deviance: 12.649  on 28  degrees of freedom
AIC: 16.649

Number of Fisher Scoring iterations: 7

For the predictor variable CCU:

  • The estimate is -3.051.

  • This means that for every one unit increase in CCU, the log odds of the response variable decreases by 3.051, holding all other predictors constant.


Call:
glm(formula = Occupancy ~ CCU, family = binomial(link = "logit"), 
    data = Mayflies)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)    5.102      2.369   2.154   0.0313 *
CCU           -3.051      1.211  -2.520   0.0117 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 34.795  on 29  degrees of freedom
Residual deviance: 12.649  on 28  degrees of freedom
AIC: 16.649

Number of Fisher Scoring iterations: 7

Now, let’s interpret the coefficient for CCU using the odds ratio interpretation:

  • The odds ratio associated with CCU is calculated as \(e^{\beta_{\text{CCU}}} = e^{-3.051}\).

  • Therefore, \(e^{\beta_{\text{CCU}}} \approx 0.048\).

  • This means that for every one unit increase in CCU, the odds of the response variable decrease by a multiple of 0.048

broom::tidy(mayfly_glm, exponentiate = T)
# A tibble: 2 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept) 164.          2.37      2.15  0.0313
2 CCU           0.0473      1.21     -2.52  0.0117

Challenge

  • Work with the odds and probability calculators

  • Check how comfortable you are with

  • additive log-odds

  • multiplicative odds

  • changing probability

  • Can you work out the probability of mayfly occupancy at each CCU level?

10:00

Model fit

For a simple Bernoulli/Binary GLM there are only a few checks that apply:

# Calculate and plot "deviance" residuals
dresid <- resid(mayfly_glm, type = "deviance")
hist(dresid)
# Plot leverage
plot(mayfly_glm, which = 5)
# Plot Cook's distance
plot(mayfly_glm, which = 4)

Model fit

Residuals in binary glms will never be normally distributed - so a “residuals vs fitted plot” provides little insight. We can try “binned residuals”:

DHARMa fit

  • DHARMa works by simulating residuals from the fitted model, comparing them with observed residuals, and providing diagnostic plots to assess model adequacy and identify potential issues like heteroscedasticity or influential data points.
library(DHARMa)
plot(simulateResiduals(mayfly_glm))

Prediction

  • predict() function for GLMs computes predicted values based on the fitted model.

  • For existing data, it predicts values for each observation in the original dataset.

  • For new data, it predicts values for a specified new dataset, allowing for predictions on unseen data.

# Predictions for existing data
predict(mayfly_glm, type = "response")
# Predictions for new data
new_CCU<- data.frame(CCU = c(0,1,2,3,4,5))
predict(mayfly_glm, newdata = new_CCU, type = "response")

Emmeans

The emmeans package in R computes estimated marginal means (EMMs) it can also provides post-hoc comparisons and contrasts for fitted models.

  • This specific function call below computes EMMs for the predictor variable CCU at specific values (0 to 5) and provides them on the response scale.
CCU prob SE df asymp.LCL asymp.UCL
0 0.99 0.014 Inf 0.61 1.0
1.0 0.89 0.13 Inf 0.39 0.99
2.0 0.27 0.15 Inf 0.078 0.62
3.0 0.017 0.026 Inf 0.00082 0.27
4.0 0.00082 0.0022 Inf 0.0000042 0.14
5.0 0.000039 0.00015 Inf 0.000000020 0.071

Prediction plots with emmeans

We can use emmeans package to compute estimated marginal means (EMMs) from a fitted GLM, convert them into a tibble format for visualization with ggplot2, overlay the means with uncertainty ribbons and add them on a scatter plot of the original data.

means <- emmeans::emmeans(mayfly_glm, 
                 specs = ~ CCU, 
                 at=list(CCU=c(0:5)), 
                 type='response') |> 
  as_tibble()

ggplot(Mayflies, aes(x=CCU, y=Occupancy)) + geom_point()+
    geom_ribbon(data = means,
              aes(x = CCU,
                  y = prob,
                  ymin = asymp.LCL,
                  ymax = asymp.UCL),
              alpha = .2)+
  geom_line(data = means,
            aes(x = CCU,
                y = prob)) +  # regression line
  labs(title = "Logit-link Binomial Regression",
       x = "CCU", y = "Occupancy")+
  theme_classic(base_size = 14)

Practice - Bacteria

10:00



Using the ‘bacteria’ dataset, model the presence of H. influenzae as a function of treatment and week of test.

Start with a full model between these two variables (include an interaction term) and use the likelihood ratio test to see if this should be retained

Rows: 220
Columns: 6
$ y    <fct> y, y, y, y, y, y, n, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y,…
$ ap   <fct> p, p, p, p, a, a, a, a, a, a, a, a, a, p, p, p, p, p, p, p, p, p,…
$ hilo <fct> hi, hi, hi, hi, hi, hi, hi, hi, lo, lo, lo, lo, lo, lo, lo, lo, l…
$ week <int> 0, 2, 4, 11, 0, 2, 6, 11, 0, 2, 4, 6, 11, 0, 2, 4, 6, 11, 0, 2, 4…
$ ID   <fct> X01, X01, X01, X01, X02, X02, X02, X02, X03, X03, X03, X03, X03, …
$ trt  <fct> placebo, placebo, placebo, placebo, drug+, drug+, drug+, drug+, d…

Solution - Bacteria

  • Fit the model and check whether residuals are ok?

  • Full model:

Single term deletions

Model:
y ~ trt * week
         Df Deviance    AIC     LRT Pr(>Chi)
<none>        203.12 215.12                 
trt:week  2   203.81 211.81 0.68544   0.7098

Solution - Bacteria

We found no significant difference in deviance when the interaction term was dropped from the full model \(\chi^2_1\) = 0.68, p = 0.7098. So this was removed from the model and estimates were calculated from an additive model containing treatment and week

Single term deletions

Model:
y ~ trt + week
       Df Deviance    AIC    LRT Pr(>Chi)   
<none>      203.81 211.81                   
trt     2   210.91 214.91 7.1026 0.028688 * 
week    1   210.72 216.72 6.9140 0.008552 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Call:
glm(formula = y ~ trt + week, family = binomial, data = bacteria)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.54629    0.40555   6.279 3.42e-10 ***
trtdrug     -1.10667    0.42519  -2.603  0.00925 ** 
trtdrug+    -0.65166    0.44615  -1.461  0.14412    
week        -0.11577    0.04414  -2.623  0.00872 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 217.38  on 219  degrees of freedom
Residual deviance: 203.81  on 216  degrees of freedom
AIC: 211.81

Number of Fisher Scoring iterations: 4

Challenge

40:00

Seychelles warbler



  • Using the malaria dataset fit a binomial glm to look at the predictors of malaria in the Seychelles warbler

  • Check the model fit

  • Make predictions

  • Produce a suitable figure

GLMs for Binomial data

Proportion data and GLM

Sometimes, count data aligns more closely with logistic regression methodologies than it initially appears.

We’re not looking at typical count data when measuring the number of occurrences with a known total sample size.

Imagine, for example, we’re studying the prevalence of a native underbrush species across various forest plots. We assess 15 different plots, each meticulously surveyed for the presence of this underbrush, counting the number of square meters where the underbrush is present versus absent within a standard area:

\[\frac{M^2~\text{with native underbrush (successes)}}{\text{Total surveyed area in}~M^2~(\text{trials})}\]

Bound between zero and one

Binomial distribution

\(logit(p) = \log \frac{p}{1 - p}\)

\(y_i = Binomial(n,p)\)

  • It is the collection of Bernoulli trials for the same event

  • It is represented by the number of Bernoulli trials \(n\)

  • It is also the probability of an event in each trial \(p\)

Binomial: Example

Tribolium

This small dataset is from an experiment looking at mortality in batches of flour exposed to different doses of pesticides.

The question of interest is:

How does pesticide dose affect mortality in the flour beetle Tribolium castaneum

The data

# A tibble: 8 × 4
   Dose Number_tested Number_killed Mortality_rate
  <dbl>         <dbl>         <dbl>          <dbl>
1  49.1            49             6          0.122
2  53.0            60            13          0.217
3  56.9            62            18          0.290
4  60.8            56            28          0.5  
5  64.8            63            52          0.825
6  68.7            59            53          0.898
7  72.6            62            61          0.984
8  76.5            60            60          1    

Preparing the data

As well as the numbers killed, we need the numbers that remain alive

beetles <- beetles |> 
  rename("dead" = Number_killed,
         "trials" = Number_tested) |> 
  mutate("alive" = trials-dead)

beetles
# A tibble: 8 × 5
   Dose trials  dead Mortality_rate alive
  <dbl>  <dbl> <dbl>          <dbl> <dbl>
1  49.1     49     6          0.122    43
2  53.0     60    13          0.217    47
3  56.9     62    18          0.290    44
4  60.8     56    28          0.5      28
5  64.8     63    52          0.825    11
6  68.7     59    53          0.898     6
7  72.6     62    61          0.984     1
8  76.5     60    60          1         0

Binomial VS Bernoulli Keypoints!

Bernoulli deals with the outcome of the single trial of the event, whereas Binomial deals with the outcome of the multiple trials of the single event.


Bernoulli is used when the outcome of an event is required for only one time,


whereas the Binomial is used when the outcome of an event is required multiple times.

Binomial glm

The GLM of the binomial counts will analyse the number of each batch of beetles killed, while taking into account the size of each group

#cbind() creates a matrix of successes and failures for each batch

beetle_glm <- glm(cbind(dead, alive) ~ Dose, family = binomial(link = "logit"), data = beetles)

summary(beetle_glm)

Call:
glm(formula = cbind(dead, alive) ~ Dose, family = binomial(link = "logit"), 
    data = beetles)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -14.57806    1.29846  -11.23   <2e-16 ***
Dose          0.24554    0.02149   11.42   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 267.6624  on 7  degrees of freedom
Residual deviance:   8.4379  on 6  degrees of freedom
AIC: 38.613

Number of Fisher Scoring iterations: 4

Weights

An alternative method is to analyse the proportion of beetles killed and including the number of trials with the weight argument

beetle_glm_weights <- glm(Mortality_rate ~ Dose, weights = trials, family = binomial(link = "logit"), data = beetles)

summary(beetle_glm_weights)

Call:
glm(formula = Mortality_rate ~ Dose, family = binomial(link = "logit"), 
    data = beetles, weights = trials)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -14.57806    1.29846  -11.23   <2e-16 ***
Dose          0.24554    0.02149   11.42   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 267.6624  on 7  degrees of freedom
Residual deviance:   8.4379  on 6  degrees of freedom
AIC: 38.613

Number of Fisher Scoring iterations: 4

Checking Binomial model fit

  • For a Binomial model standard residual checks do not apply
# Calculate and plot "deviance" residuals

par(mfrow = c(2, 2), mar = c(4, 4, 2, 1))

pred <- predict(beetle_glm_weights) 
dresid <- resid(beetle_glm_weights, type = "deviance")

# Check for overdispersion
hist(dresid)

# Binned residuals
arm::binnedplot(dresid, pred)

# Plot leverage
plot(beetle_glm_weights, which = 5)

plot(beetle_glm_weights, which = 2)

Performance residual checks

performance::check_model(beetle_glm_weights)

performance::binned_residuals(beetle_glm_weights)
Warning: Probably bad model fit. Only about 67% of the residuals are inside the error bounds.

Checking with DHARMa

  • DHARMa works by simulating residuals from the fitted model, comparing them with observed residuals, and providing diagnostic plots to assess model adequacy and identify potential issues like heteroscedasticity or influential data points.
library(DHARMa)
plot(simulateResiduals(beetle_glm_weights))

Estimates of significance

Confidence intervals

library(broom)
tidy(beetle_glm, conf.int = T)
# A tibble: 2 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)  -14.6      1.30       -11.2 3.00e-29  -17.3     -12.2  
2 Dose           0.246    0.0215      11.4 3.18e-30    0.206     0.290
drop1(beetle_glm, test = "Chi")
Single term deletions

Model:
cbind(dead, alive) ~ Dose
       Df Deviance     AIC    LRT  Pr(>Chi)    
<none>       8.438  38.613                     
Dose    1  267.662 295.837 259.23 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Table

This function has automatically applied the exponentiate to produce the odds and odds ratio:

library(sjPlot)

tab_model(beetle_glm)
  cbind(dead, alive)
Predictors Odds Ratios CI p
(Intercept) 0.00 0.00 – 0.00 <0.001
Dose 1.28 1.23 – 1.34 <0.001
Observations 8

Predictions

emmeans::emmeans(beetle_glm, 
                 specs = ~ Dose, 
                 at=list(Dose=c(40, 50, 60, 70, 80)), 
                 type='response') 
 Dose    prob      SE  df asymp.LCL asymp.UCL
   40 0.00852 0.00382 Inf   0.00354    0.0204
   50 0.09103 0.02099 Inf   0.05742    0.1414
   60 0.53851 0.03260 Inf   0.47432    0.6014
   70 0.93149 0.01595 Inf   0.89282    0.9569
   80 0.99373 0.00279 Inf   0.98506    0.9974

Confidence level used: 0.95 
Intervals are back-transformed from the logit scale 

Predictions plot

Overdispersion

If the residual deviance of the model is > the residual degrees of freedom - we consider our binomial models to have overdispersion.

This means the variance is greater than predicted by the binomial distribution

Overdispersion > 1.5 can be accounted for with the quasi-binomial glm


Call:
glm(formula = Mortality_rate ~ Dose, family = binomial(link = "logit"), 
    data = beetles, weights = trials)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -14.57806    1.29846  -11.23   <2e-16 ***
Dose          0.24554    0.02149   11.42   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 267.6624  on 7  degrees of freedom
Residual deviance:   8.4379  on 6  degrees of freedom
AIC: 38.613

Number of Fisher Scoring iterations: 4

Overdispersion

Binomial \(P(X = s) = C(n, s) \cdot p^s \cdot (1 - p)^{n - s}\)

Quasibinomial \(P(X = s) = C(n, s) \cdot p(p+k\theta)^{s-1} \cdot (1 - pk\theta)^{n - s}\)

Where:

  • \(n\) is the total number of trials of an event.

  • \(s\) corresponds to the number of times an event should occur.

  • \(p\) is the probability that the event will occur.

  • \((1 - p)\) is the probability that the event will not occur.

  • \(C\) term represents combinations, calculated as \(C(n, s) = \frac{n!}{s!(n - s)!}\), representing the number of ways to choose \(s\) successes out of \(n\) trials.

  • \(\theta\) term describes additional variance outside of the Binomial distribution

Fitting the quasibinomial model


Call:
glm(formula = cbind(dead, alive) ~ Dose, family = binomial(link = "logit"), 
    data = beetles)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -14.57806    1.29846  -11.23   <2e-16 ***
Dose          0.24554    0.02149   11.42   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 267.6624  on 7  degrees of freedom
Residual deviance:   8.4379  on 6  degrees of freedom
AIC: 38.613

Number of Fisher Scoring iterations: 4

Call:
glm(formula = cbind(dead, alive) ~ Dose, family = quasibinomial(link = "logit"), 
    data = beetles)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -14.57806    1.46611  -9.943 5.98e-05 ***
Dose          0.24554    0.02427  10.118 5.42e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasibinomial family taken to be 1.274895)

    Null deviance: 267.6624  on 7  degrees of freedom
Residual deviance:   8.4379  on 6  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 4

Fitting the quasibinomial model

Standard Error, Confidence intervals and p-values WILL change.

Estimates remain the same

Likelihood ratio tests should now be applied with an F-statistics

drop1(beetle_quasiglm, test = "F")
Single term deletions

Model:
cbind(dead, alive) ~ Dose
       Df Deviance F value    Pr(>F)    
<none>       8.438                      
Dose    1  267.662  184.33 9.908e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Challenge: Binomial model

On the afternoon of January 28th, 1986 the space shuttle Challenger experienced a critical failure and broke apart in mid air, resulting in the deaths of all seven crewmembers.

An investigation discovered critical failures in all six of the O-rings in the solid rocket booster.

How did temperature affect the probability of o-ring failure?

What was the probability of all o-rings failing when the temperature was 36 degrees F

40:00



Challenger

GLMs for count data

Poisson

Count or rate data are ubiquitous in the life sciences (e.g number of parasites per microlitre of blood, number of species counted in a particular area). These type of data are discrete and non-negative.

A useful distribution to model abundance data is the Poisson distribution:

a discrete distribution with a single parameter, λ (lambda), which defines both the mean and the variance of the distribution.

GLM

Recall the simple linear regression case (i.e a GLM with a Gaussian error structure and identity link). For the sake of clarity let’s consider a single explanatory variable where \(\mu\) is the mean for Y:

\(\mu_i = \beta0 + \beta_1x1 + \beta_2x2 + ... +\beta_nxn\)

\(y_i \sim \mathcal{N}(\mu_i, \sigma^2)\)

  • The mean function is unconstrained, i.e the value of \(\beta_0 + \beta_1X\) can range from \(-\infty\) to \(+\infty\).

Poisson

  • If we want to model count data we therefore want to constrain this mean to be positive only.
  • Mathematically we can do this by taking the logarithm of the mean (the log is the default link for the Poisson distribution).

  • We then assume our count data variance to be Poisson distributed (a discrete, non-negative distribution), to obtain our Poisson regression model (to be consistent with the statistics literature we will rename \(\mu\) to \(\lambda\)):

GLM Poisson

Poisson equation

Poisson limitations

The Poisson distribution has the following characteristics:

  • Discrete variable (integer), defined on the range \(0, 1, \dots, \infty\).

  • A single rate parameter \(\lambda\), where \(\lambda > 0\).

  • Mean = \(\lambda\)

  • Variance = \(\lambda\)

Poisson: Case study

Cuckoo

In a study by Kilner et al. (1999), the authors studied the begging rate of nestlings in relation to total mass of the brood of reed warbler chicks and cuckoo chicks. The question of interest is:

How does nestling mass affect begging rates between the different species?

This model is inadequate. It is predicting negative begging calls within the range of the observed data, which clearly does not make any sense.

cuckoo_lm <- lm(Beg ~ Mass + Species + Mass:Species, data = cuckoo)

Diagnostics

Let us display the model diagnostics plots for this linear model.

  • Curvature

  • Funnelling effect

Biological data and distributions

We can look at the kurtosis and skewness of our univariate variable - Begging:

descdist(cuckoo$Beg, boot = 500, discrete = T)

summary statistics
------
min:  0   max:  55 
median:  9 
mean:  16.13725 
estimated sd:  17.03176 
estimated skewness:  0.8270038 
estimated kurtosis:  2.367674 

Univariate distributions

par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))
fp <- fitdist(cuckoo$Beg, "pois")
fn <- fitdist(cuckoo$Beg, "norm")
fnb<- fitdist(cuckoo$Beg, "nbinom")
plot.legend <- c("normal", "poisson", "negative binomial")
denscomp(list(fn, fp, fnb), legendtext = plot.legend)
qqcomp(list(fn, fp, fnb), legendtext = plot.legend)

Poisson model

We should therefore try a different model structure.

The response variable in this case is a classic count data: discrete and bounded below by zero (i.e we cannot have negative counts). We will therefore try a Poisson model using the canonical log link function for the mean:

  • λ varies with x (mass) which means residual variance will also vary with x, which means that we just relaxed the homogeneity of variance assumption!

  • Predicted values will now be integers instead of fractions

  • The model will never predict negative values (Poisson is strictly positive)

\[ \log{\lambda} = \beta_0 + \beta_1 M_i + \beta_2 S_i + \beta_3 M_i S_i \]

Poisson model

\[ \log{\lambda} = \beta_0 + \beta_1 M_i + \beta_2 S_i + \beta_3 M_i S_i \]

where \(M_i\) is nestling mass and \(S_i\) a dummy variable

\(S_i = \left\{\begin{array}{ll}1 & \text{if } i \text{ is warbler},\\0 & \text{otherwise}.\end{array}\right.\)

The term \(M_iS_i\) is an interaction term. Think of this as an additional explanatory variable in our model. Effectively it lets us have different slopes for different species (without an interaction term we assume that both species have the same slope for the relationship between begging rate and mass, and only the intercept differ).

Regression lines

The mean regression lines for the two species look like this:

  • Cuckoo (\(S_i=0\))

  • \(\begin{aligned}\log{\lambda} & = \beta_0 + \beta_1 M_i + (\beta_2 \times 0) + (\beta_3 \times M_i \times 0)\\\log{\lambda} & = \beta_0 + \beta_1 M_i\end{aligned}\)

  • Warbler (\(S_i=1\))

  • \(\begin{aligned}\log{\lambda} & = \beta_0 + \beta_1 M_i + (\beta_2 \times 1) + (\beta_3 \times M_i \times 1)\\\log{\lambda} & = \beta_0 + \beta_1 M_i + \beta_2 + \beta_3M_i\\\end{aligned}\)

Fit the Poisson model

Fit the model with the interaction term in R:

cuckoo_glm1 <- glm(Beg ~ Mass + Species + Mass:Species, data=cuckoo, family=poisson(link="log"))

summary(cuckoo_glm1)

Call:
glm(formula = Beg ~ Mass + Species + Mass:Species, family = poisson(link = "log"), 
    data = cuckoo)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)          0.334475   0.227143   1.473  0.14088    
Mass                 0.094847   0.007261  13.062  < 2e-16 ***
SpeciesWarbler       0.674820   0.259217   2.603  0.00923 ** 
Mass:SpeciesWarbler -0.021673   0.008389  -2.584  0.00978 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 970.08  on 50  degrees of freedom
Residual deviance: 436.05  on 47  degrees of freedom
AIC: 615.83

Number of Fisher Scoring iterations: 6

Note there appears to be a negative interaction effect for Species:Mass, indicating that Begging calls do not increase with mass as much as you would expect for Warblers as compared to Cuckoos.

Exercise

  • Fit a Poisson model to the cuckoo data

  • Look at the residual plots - how have they changed?

15:00

Model summary

summary(cuckoo_glm1)

Call:
glm(formula = Beg ~ Mass + Species + Mass:Species, family = poisson(link = "log"), 
    data = cuckoo)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)          0.334475   0.227143   1.473  0.14088    
Mass                 0.094847   0.007261  13.062  < 2e-16 ***
SpeciesWarbler       0.674820   0.259217   2.603  0.00923 ** 
Mass:SpeciesWarbler -0.021673   0.008389  -2.584  0.00978 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 970.08  on 50  degrees of freedom
Residual deviance: 436.05  on 47  degrees of freedom
AIC: 615.83

Number of Fisher Scoring iterations: 6

Poisson model check

VIF

VIF, or Variance Inflation Factor, measures the extent of multicollinearity among independent variables in regression analysis.

High multicollinearity, indicated by elevated VIF values, can distort regression coefficients and inflate standard errors, undermining the reliability of the model’s predictions

car::vif(cuckoo_glm1)
        Mass      Species Mass:Species 
    4.050091    12.967267    14.496767 
check_model(cuckoo_glm1, check = "vif")

Mean-centering

True VIF is highly problematic

  • Mean centering can remove apparent VIF as means become zero

  • This often decreases potential correlation

cuckoo$mass.c <- cuckoo$Mass - mean(cuckoo$Mass, na.rm =T)

cuckoo_glm2 <- glm(Beg ~ mass.c + Species + mass.c:Species, data=cuckoo, family=poisson(link="log"))

summary(cuckoo_glm2)

Call:
glm(formula = Beg ~ mass.c + Species + mass.c:Species, family = poisson(link = "log"), 
    data = cuckoo)

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)            2.263284   0.091946  24.615  < 2e-16 ***
mass.c                 0.094847   0.007261  13.062  < 2e-16 ***
SpeciesWarbler         0.234068   0.106710   2.194  0.02827 *  
mass.c:SpeciesWarbler -0.021673   0.008389  -2.584  0.00978 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 970.08  on 50  degrees of freedom
Residual deviance: 436.05  on 47  degrees of freedom
AIC: 615.83

Number of Fisher Scoring iterations: 6

Exercise



  • cuckoo$mass.c <- cuckoo$Mass - mean(cuckoo$Mass, na.rm =T)

  • Mean center the mass variable and recheck VIF. car::vif()

10:00

Model selection

For a term with only one degree of freedom \(z^2\) = the chi-square value

The likelihood ratio test allows us to determine whether we can remove an interaction term

# For a fixed  mean-variance model we use a Chisquare distribution
drop1(cuckoo_glm2, test = "Chisq")
Single term deletions

Model:
Beg ~ mass.c + Species + mass.c:Species
               Df Deviance    AIC   LRT Pr(>Chi)   
<none>              436.05 615.83                  
mass.c:Species  1   442.81 620.59 6.766 0.009291 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Offset - for rates

In Poisson GLMs, using an offset allows us to model rates rather than counts, by including the logarithm of the exposure variable as an offset, ensuring that the model accounts for the differing exposure levels across observations.

cuckoo_glm3 <- glm(Call_Total ~ mass.c + Species + mass.c:Species, data=cuckoo, offset = log(Mins), family=poisson(link="log"))

summary(cuckoo_glm3)

Call:
glm(formula = Call_Total ~ mass.c + Species + mass.c:Species, 
    family = poisson(link = "log"), data = cuckoo, offset = log(Mins))

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)            2.249441   0.030831  72.961   <2e-16 ***
mass.c                 0.129261   0.002502  51.671   <2e-16 ***
SpeciesWarbler         0.300202   0.035330   8.497   <2e-16 ***
mass.c:SpeciesWarbler -0.057893   0.002829 -20.462   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 9597.9  on 50  degrees of freedom
Residual deviance: 3071.8  on 47  degrees of freedom
AIC: 3334.6

Number of Fisher Scoring iterations: 6

Poisson limitations

So for the Poisson regression case we assume that the mean and variance are the same. Hence, as the mean increases, the variance increases also (heteroscedascity). This may or may not be a sensible assumption so watch out! Just because a Poisson distribution usually fits well for count data, doesn’t mean that a Gaussian distribution can’t always work.

Recall the link function between the predictors and the mean and the rules of logarithms (if \(\log{\lambda} = k\)(value of predictors), then \(\lambda = e^k\)):

Overdispersion

When the residual deviance is higher than the residual degrees of freedom, we say that the model is overdispersed. This situation is mathematically represented as:

\[ \phi = \frac{\text{Residual deviance}}{\text{Residual degrees of freedom}} \]

Overdispersion occurs when the variance in the data is even higher than the mean, indicating that the Poisson distribution might not be the best choice. This can be due to many reasons, such as the presence of many zeros, many very high values, or missing covariates in the model.

Overdispersion and what to do about it

Overdispersion statistic values > 1
Causes of over-dispersion What to do about it
Model mis-specification (missing covariates or interactions) Add more covariates or interaction terms
Too many zeros (“zero inflation”) Use a zero-inflated model
Non-independence of observations Use a generalised mixed model with random effects to take non-independence into account
Variance is larger than the mean Use a quasipoisson GLM if overdispersion = 2-15. Use a negative binomial GLM if > 15-20

Choosing a model

Poisson/NegBin

Overdispersion and what to do about it

It may be true that you are initially unsure whether overdispersion is coming from zero-inflation, larger variance than the mean or both.

You can fit models with the same dependent and independent variables and different error structures and use AIC to help determine the best model for your data

Overdispersion in the cuckoo model

\(\phi = \frac{\text{Residual deviance}}{\text{Residual degrees of freedom}}\) = \(\frac{436}{47} = 9.2\)

summary(cuckoo_glm2)

Call:
glm(formula = Beg ~ mass.c + Species + mass.c:Species, family = poisson(link = "log"), 
    data = cuckoo)

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)            2.263284   0.091946  24.615  < 2e-16 ***
mass.c                 0.094847   0.007261  13.062  < 2e-16 ***
SpeciesWarbler         0.234068   0.106710   2.194  0.02827 *  
mass.c:SpeciesWarbler -0.021673   0.008389  -2.584  0.00978 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 970.08  on 50  degrees of freedom
Residual deviance: 436.05  on 47  degrees of freedom
AIC: 615.83

Number of Fisher Scoring iterations: 6

Quasi-Poisson GLM

The systematic part and the link function remain the same

\[ \log{\lambda} = \beta_0 + \beta_1 x_i \]

phi\(\phi\) is a dispersion parameter, it will not affect estimates but will affect significance. Standard errors are multiplied by \(\sqrt\phi\)

\[ Y_i = Poisson(\phi \lambda_i) \]

Where:

  • \(Yi\) is the response variable.
  • \(\phi\) is the dispersion parameter, which adjusts the variance of the distribution

Confidence intervals and p-values WILL change.

Exercise

10:00
  • Fit a quasipoisson model to the cuckoo data
cuckoo_glmq <- glm(Beg ~ mass.c + Species + mass.c:Species, data=cuckoo, family=quasipoisson(link="log"))
  • Look at the residual plots - how have they changed?

  • Look at the SE and p-values - how have they changed

  • Calculate new 95% confidence intervals

Compare models

summary(cuckoo_glm2)
summary(cuckoo_glmq)

Call:
glm(formula = Beg ~ mass.c + Species + mass.c:Species, family = poisson(link = "log"), 
    data = cuckoo)

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)            2.263284   0.091946  24.615  < 2e-16 ***
mass.c                 0.094847   0.007261  13.062  < 2e-16 ***
SpeciesWarbler         0.234068   0.106710   2.194  0.02827 *  
mass.c:SpeciesWarbler -0.021673   0.008389  -2.584  0.00978 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 970.08  on 50  degrees of freedom
Residual deviance: 436.05  on 47  degrees of freedom
AIC: 615.83

Number of Fisher Scoring iterations: 6

Call:
glm(formula = Beg ~ mass.c + Species + mass.c:Species, family = quasipoisson(link = "log"), 
    data = cuckoo)

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)            2.26328    0.25554   8.857 1.38e-11 ***
mass.c                 0.09485    0.02018   4.700 2.30e-05 ***
SpeciesWarbler         0.23407    0.29657   0.789    0.434    
mass.c:SpeciesWarbler -0.02167    0.02332  -0.930    0.357    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 7.7242)

    Null deviance: 970.08  on 50  degrees of freedom
Residual deviance: 436.05  on 47  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 6

F distributions

The presence of overdispersion suggested the use of the F-test for nested models. We will test if the interaction term can be dropped from the model.

Though here for a term with only one degree of freedom \(t^2\) = the F distribution with denominator degree of freedom

drop1(cuckoo_glmq, test = "F")
Single term deletions

Model:
Beg ~ mass.c + Species + mass.c:Species
               Df Deviance F value Pr(>F)
<none>              436.05               
mass.c:Species  1   442.81  0.7293 0.3974

Challenge: Monarch butterflies

This dataset contains standardised counts of Monarch butterflies across three different gardens.

Is there a difference in observance of the butterflies between these three sites?

Check the poisson model is a good fit for this data

Get the monarchs.csv dataset

15:00



Monarch butterfly

Negative Binomial

Negative Binomial

Negative Binomial

Negative binomial GLMs are favored when overdispersion is high.

It has two parameters, \(μ\) and \(\theta\). \(\theta\) controls for the dispersion parameter (smaller \(\theta\) indicates higher dispersion). It corresponds to a combination of two distributions (Poisson and gamma).

It assumes that the \(Y_i\) are Poisson distributed with the mean \(μ_i\) assumed to follow a gamma distribution:

\[ E(Y_i) = μ_i \\ \text{Var}(Y_i) = μ_i + μ_i^2/k \]

Fitting a negative binomial in R

Negative binomial is not in the glm() function. We need the MASS package:

Suitable for strongly over-dispersed zero-bounded integer data.

library(MASS)

model <- glm.nb(y ~ x1 + x2, link = "log", data = dataframe)

Negbin: Example

10:00

Stickleback


Open the parasite dataset and script

Determine whether the poisson model is suitable for this dataset?



The example is an experiment measuring the effect of the parasitic tapeworm Schistocephalus solidus infection on the susceptibility of infection from a second parasite, the trematode Diplostomum pseudospathaceum

The question of interest is:

How does Treatment/Infection status affect trematode counts?

Univariate distribution

par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))
fp <- fitdist(parasite$Diplo_intensity, "pois")
fn <- fitdist(parasite$Diplo_intensity, "norm")
fnb<- fitdist(parasite$Diplo_intensity, "nbinom")
plot.legend <- c("normal", "poisson", "negative binomial")
denscomp(list(fn, fp, fnb), legendtext = plot.legend)
qqcomp(list(fn, fp, fnb), legendtext = plot.legend)

Check Poisson model

summary(stick_poisson)

Call:
glm(formula = Diplo_intensity ~ Treatment, family = "poisson", 
    data = parasite)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)           1.82141    0.04740  38.423  < 2e-16 ***
TreatmentInfected HG  0.53969    0.05879   9.180  < 2e-16 ***
TreatmentInfected LG -0.19726    0.09771  -2.019 0.043492 *  
TreatmentUninfected  -0.31733    0.08543  -3.715 0.000203 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 887.73  on 220  degrees of freedom
Residual deviance: 697.25  on 217  degrees of freedom
AIC: 1481.5

Number of Fisher Scoring iterations: 5

Check zero-inflation

The model is underfitting zeros (but not by much).

check_zeroinflation(stick_poisson)
# Check for zero-inflation

   Observed zeros: 6
  Predicted zeros: 1
            Ratio: 0.17
100*sum(parasite$Diplo_intensity == 0)/nrow(parasite)
[1] 2.714932
# Only 2% of our data has zeros

Compare models

# Quasilikelihood
stick_quasi <- glm(Diplo_intensity ~ Treatment, family = quasipoisson, data = parasite)
# Neg bin
stick_nb <- glm.nb(Diplo_intensity ~ Treatment, link = "log", data = parasite)
# Zero-inflated model
stick_zip <- zeroinfl(Diplo_intensity ~ Treatment| 
                   Treatment,
                dist = "poisson",
                link = "logit",
                data = parasite)

Compare models

AIC(stick_poisson)
[1] 1481.456
AIC(stick_quasi)
[1] NA
AIC(stick_nb)
[1] 1259.893
AIC(stick_zip)
[1] 1466.362

Validating Negative Binomial models

library(DHARMa)
plot(simulateResiduals(stick_nb))
par(mfrow = c(2, 2), mar = c(4, 4, 2, 1))

plot(stick_nb)

Comparison of model predictions

  • Luckily in this instance we can be assured that each of these models makes very similar predictions:

Zero-inflation

Zero-inflation

“Zero-inflation” refers to the problem of “too many zeros”.

The dependent variable contains more zeros than would be expected under the standard statistical distribution for zero-bounded count data (Poisson or negative binomial).

Zero-inflation is a common feature of count datasets. Large numbers of zeros can arise because

  1. There was nothing there to count (true zeros)

  2. Things were there but missed (false zeros)

Zero-inflation

Model Acronym/Abbr Source of Overdispersion
Zero-inflated Poisson ZIP Zeros
Negative Binomial NegBin Positive integers
Zero-inflated Negative Binomial ZINB Zeros + Positive integers

Zip

ZIP in R

In a ZIP model we define two halves of the model; the first half models the count process (i.e. a Poisson GLM explaining non-zero integers) and the second half models the binomial (i.e. explains the zeros).

install.packages("pscl")
library(pscl)
zip1 <- zeroinf(y ~ x1 + x2 | x1 + x2,
                dist = "poisson",
                link = "logit",
                data = dataframe)

Alternative model design

You can change the zero-inflation model’s link function and use AIC to decide on the best model

# zeros modelled with a constant
y ~ x1 + x2

# zeros modelled with the same variables
y ~ x1 + x2 | x1 + x2

# zeros modelled with different variables
y ~ x1 + x2 | z1 +z2

Checking for zero-inflation

hist(cuckoo$Beg)

100*sum(cuckoo$Beg== 0)/nrow(cuckoo)
[1] 25.4902
# 25% of data is zeros
performance::check_zeroinflation(cuckoo_glm2)
# Check for zero-inflation

   Observed zeros: 13
  Predicted zeros: 0
            Ratio: 0.00

Fit the model

Fit the model

cuckoo_zip <- zeroinfl(Beg ~ mass.c + Species + mass.c:Species| 
                   mass.c + Species + mass.c:Species,
                dist = "poisson",
                link = "logit",
                data = cuckoo)

summary(cuckoo_zip)

Call:
zeroinfl(formula = Beg ~ mass.c + Species + mass.c:Species | mass.c + 
    Species + mass.c:Species, data = cuckoo, dist = "poisson", link = "logit")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-3.2611 -0.9164 -0.1979  0.8878  3.7917 

Count model coefficients (poisson with log link):
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)            3.395981   0.104443  32.515  < 2e-16 ***
mass.c                 0.027481   0.008805   3.121 0.001802 ** 
SpeciesWarbler        -0.800911   0.117996  -6.788 1.14e-11 ***
mass.c:SpeciesWarbler  0.038046   0.009819   3.875 0.000107 ***

Zero-inflation model coefficients (binomial with logit link):
                      Estimate Std. Error z value Pr(>|z|)  
(Intercept)             0.8893     0.7431   1.197   0.2314  
mass.c                 -0.2154     0.1078  -1.998   0.0457 *
SpeciesWarbler         -8.6156     4.1849  -2.059   0.0395 *
mass.c:SpeciesWarbler  -0.3357     0.3483  -0.964   0.3352  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 18 
Log-likelihood: -155.6 on 8 Df

Check for zero-inflated negbin

cuckoo_zinb <- zeroinfl(Beg ~ mass.c + Species + mass.c:Species| 
                   mass.c + Species + mass.c:Species,
                dist = "negbin",
                link = "logit",
                data = cuckoo)

summary(cuckoo_zinb)

Call:
zeroinfl(formula = Beg ~ mass.c + Species + mass.c:Species | mass.c + 
    Species + mass.c:Species, data = cuckoo, dist = "negbin", link = "logit")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-1.9866 -0.5772 -0.1885  0.6021  2.2198 

Count model coefficients (negbin with log link):
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)            3.38815    0.22489  15.066  < 2e-16 ***
mass.c                 0.02831    0.02051   1.381 0.167410    
SpeciesWarbler        -0.79777    0.23905  -3.337 0.000846 ***
mass.c:SpeciesWarbler  0.03786    0.02175   1.740 0.081793 .  
Log(theta)             2.27765    0.43523   5.233 1.67e-07 ***

Zero-inflation model coefficients (binomial with logit link):
                      Estimate Std. Error z value Pr(>|z|)  
(Intercept)             0.8893     0.7431   1.197   0.2314  
mass.c                 -0.2154     0.1078  -1.998   0.0457 *
SpeciesWarbler         -8.6653     4.2549  -2.037   0.0417 *
mass.c:SpeciesWarbler  -0.3374     0.3530  -0.956   0.3391  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta = 9.7537 
Number of iterations in BFGS optimization: 33 
Log-likelihood: -143.9 on 9 Df

Model validation

Model validation is not straightforward for zero-inflated models

Neither the performance package or DHARMa will produce outputs at this time.

We have to create manual checks of the residuals:

sresid <- residuals(cuckoo_zip, type = "pearson")

pred <- predict(cuckoo_zip)

sresidnb <- residuals(cuckoo_zinb, type = "pearson")

prednb <- predict(cuckoo_zinb)

Zero-inflated poisson

par(mfrow = c(2, 2), mar = c(4, 4, 2, 1))

hist(sresid)
plot(sresid ~ pred)
qqnorm(sresid)
qqline(sresid, col = "red")

Zero-inflated negative binomial

par(mfrow = c(2, 2), mar = c(4, 4, 2, 1))

hist(sresidnb)
plot(sresidnb ~ prednb)
qqnorm(sresidnb)
qqline(sresidnb, col = "red")

Model comparisons

How do we know which of is the best model of our data?

Do I really NEED zero-inflation? Check against negative binomial

Here I am comparing

AIC(cuckoo_glm1, 
    cuckoo_nb, 
    cuckoo_zip, 
    cuckoo_zinb)
            df      AIC
cuckoo_glm1  4 615.8282
cuckoo_nb    5 357.2955
cuckoo_zip   8 327.2635
cuckoo_zinb  9 305.8751

Choosing Zero-inflation models

  • Check for overdispersion

  • Is your model underfitting zeros?

  • AND do zeros make up a substantial part of your data?

  • Do you think there might be a different process controlling 0/1 from count?

  • You probably want to compare zip and zinb for best fit.

Challenge - Zeroinflation

20:00

Slugs



  • Using the revegetation dataset

  • Check the model fit

  • Determine if zero-inflation is required

  • Validate the model

Linear Mixed Models {background-color = “D9DBDB”}

Why mixed model?

The key feature of a linear mixed-effects model is the inclusion of random effects - variables where our observations are grouped into subcategories that systematically affect our outcome - to account for important structure in our data.

Mixed-effects model can be used in many situations instead of one of our more straightforward tests when this structure may be important:

  1. mixed-effects models incorporate group and even individual-level differences

  2. mixed-effects models cope well with missing data, unequal group sizes and repeated measurements

Examples of structured data

Hierarchy

  • Estimating scores on a standardised maths test - students from the same maths class should be grouped

Repeated measures

  • Medical interventions on blood sugar levels - each person is measured multiple times over a year

Fixed and Random Effects

Fixed Effect

Fixed effects are variables that we expect will affect the dependent/response variable: they’re what you call explanatory variables in a standard linear regression.

\({y_i = \beta0 + \beta1x+\epsilon}\)

y ~ x, data = data)

Random Effect

A random effect is a parameter that is allowed to vary across groups or individuals. Random effects do not take a single fixed value, rather they follow a distribution (usually the normal distribution) and allow a model to account for variation around an intercept or slope.

\({y_i = \beta0_{[i]} + \beta1x+\epsilon}\)

y ~ x + (1|group), data = data)

No pooling

Partial pooling

Shrinkage

Model in R

lmer1 <- lmer(y ~ x + (1|group), data = data)
summary(lmer1)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: y ~ x + (1 | group)
   Data: data

REML criterion at convergence: 3224.4

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.61968 -0.63654 -0.03584  0.66113  3.13597 

Random effects:
 Groups   Name        Variance Std.Dev.
 group    (Intercept) 205      14.32   
 Residual             101      10.05   
Number of obs: 430, groups:  group, 5

Fixed effects:
            Estimate Std. Error       df t value Pr(>|t|)    
(Intercept)  23.2692     6.4818   4.1570    3.59   0.0215 *  
x             2.0271     0.1703 424.0815   11.90   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
  (Intr)
x -0.131

Group the variation associated with the group effect (after fixed effects are fitted)

Residual The residual/leftover deviations or variance not due to fixed or random effects

Similar in interpretation to fixed effects from models you have constructed before

However p-values for mixed models aren’t as straightforward as they are for the linear model. There are multiple approaches, and there’s a discussion surrounding these, with sometimes wildly differing opinions about which approach is the best.

lmerTest lmer model fits via Satterthwaite’s degrees of freedom method estimation.

Fixed effects:
            Estimate Std. Error       df t value Pr(>|t|)    
(Intercept)  23.2692     6.4818   4.1570    3.59   0.0215 *  
x             2.0271     0.1703 424.0815   11.90   <2e-16 ***

Random slopes and intercepts

\({y_i = \beta0_{[i]} + \beta1x+\epsilon}\)

y ~ x + (1|group), data = data)

\({y_i = \beta0_{[i]} + \beta1_{[i]}x+\epsilon}\)

y ~ x + (x|group), data = data)

Checking models

Checking models with random effects is more complex

check_model will add plots to investigate the distribution of random effects

DHARMa when dealing with increasingly complex models - residual simulations may be preferred

Object of Class DHARMa with simulated residuals based on 250 simulations with refit = FALSE . See ?DHARMa::simulateResiduals for help. 
 
Scaled residual values: 0.036 0.112 0.808 0.22 0.364 0.72 0.656 0.48 0.464 0.172 0.908 0.604 0.228 0.276 0.94 0.428 0.164 0.372 1 0.58 ...

Comparing models

Comparisons of random effects may be checked with the Likelihood Ratio Test (LRT) and a \(\chi^2\) distribution

anova(random_slope_intercept, random_intercept)
Data: data
Models:
lmer1: y ~ x + (1 | group)
lmer3: y ~ x + (x | group)
      npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)   
lmer1    4 3236.1 3252.3 -1614.0   3228.1                        
lmer3    6 3228.4 3252.8 -1608.2   3216.4 11.649  2   0.002954 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Fixed effects

Comparisons of fixed effects may be checked with the Likelihood Ratio Test (LRT) and an \(F\) distribution

anova(model, test = "F")
drop1(lmer1, test = "F")
Single term deletions using Satterthwaite's method:

Model:
y ~ x + (1 | group)
  Sum Sq Mean Sq NumDF  DenDF F value    Pr(>F)    
x  14303   14303     1 424.08  141.61 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Choosing your model

  • Explore your data and understand it

  • Think carefully before deciding on the structure of your model

  • Often better to ‘keep it maximal’ with regards to random effects (if you can) - don’t be guided solely by LRT.

Issues

Convergence warning

Model may need simplification or an alternative algorithm

lmer3 <- lmer(y ~ x + (x | group), data = data)
Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.00241 (tol = 0.002, component 1)

Boundary fit

Your model did fit, but it generated that warning because some of your random effects are very small

Reporting models

  1. Describe and justify your model (summarize the model fit).
  • Specify all fixed and random effects, interactions, intercepts, and slopes
  • Specify sample size: number of observations, and number of levels for each grouping factor
  1. Mention that you checked assumptions (and how), and that the assumptions are satisfied.

  2. Fixed effects: Report significance (test statistic (t/F), degrees of freedom, p-value) and actual estimates for coefficients (effect size: magnitude and direction, standard errors, and confidence intervals). Report Post-hoc tests (emmeans()).

  3. Random effects: Report the variances, standard deviations

Wrap-up

Where next?



  • (Generalised) Linear Mixed Models
  • Generalised Additive Models

Additional resources

  • GLMs resource list:

  • Discovering Statistics - Andy Field

  • An Introduction to Generalized Linear Models - Dobson & Barnett

  • An Introduction to Statistical Learning with Applications in R - James, Witten, Hastie & Tibshirani

  • Mixed Effects Models and Extensions in Ecology with R - Zuur, et al.

  • Ecological Statistics with contemporary theory and application

  • The Big Book of R (https://www.bigbookofr.com/)